Identify and separate observations into distinct categories.
Examples: tag emails as spam, diagnose a particular disease in a patient, identify defective products in a factory line, discard background events in physics analysis, etc...
Any algorithm or procedure that maps a set of inputs into a discrete value is called a "Classifier"
In general this will be some kind of function of N variables (or features) grouping events into classes, according to the values of the associated variables.
Binary classification: only two categories/classes are considered
Confusion matrix:
S (predicted) | B (predicted) | |
S (real) | True positives | False negatives |
B (real) | False positives | True negatives |
Type I Error
Type II Error
Confusion matrix:
Receiver Operating Characteristic (ROC)
Simulation from the CTA experiment
Goal: reject showers caused by hadrons, while keeping showers from gamma rays
10 features (variables)
We will try two methods
- Likelihood ratio
- Boosted Decision Trees (BDT)
You will be given a ROOT file with two TTree objects inside (one for Signal events, one for Background events), each having 10 Branches
❯ root -l magic04.root
root [0]
Attaching file magic04.root as _file0...
(TFile *) 0x555bf47811c0
root [1] .ls
TFile** magic04.root
TFile* magic04.root
KEY: TTree Signal;1
KEY: TTree Background;1
root [2] Signal->Print()
******************************************************************************
*Tree :Signal : *
*Entries : 12332 : Total = 501716 bytes File Size = 395259 *
* : : Tree compression factor = 1.00 *
******************************************************************************
*Br 0 :fLength : fLength/F *
*Entries : 12332 : Total Size= 50149 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 1 :fWidth : fWidth/F *
*Entries : 12332 : Total Size= 50141 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 2 :fSize : fSize/F *
*Entries : 12332 : Total Size= 50133 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 3 :fConc : fConc/F *
*Entries : 12332 : Total Size= 50133 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 4 :fConc1 : fConc1/F *
*Entries : 12332 : Total Size= 50141 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 5 :fAsym : fAsym/F *
*Entries : 12332 : Total Size= 50133 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 6 :fM3Long : fM3Long/F *
*Entries : 12332 : Total Size= 50149 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 7 :fM3Trans : fM3Trans/F *
*Entries : 12332 : Total Size= 50157 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 8 :fAlpha : fAlpha/F *
*Entries : 12332 : Total Size= 50141 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 9 :fDist : fDist/F *
*Entries : 12332 : Total Size= 50133 bytes All baskets in memory *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
We will begin with an "exploratory" exercise.
Before attempting to build a classifier it is always a good idea to visualize the dataset and get a feeling of how different the classes are, among the features of the dataset.
Main goals:
You will find the data at ~vformato/2023/data/magic04.root on the course VM
Opening a file
TFile* my_file = TFile::Open("filename.root");
// ...
my_file->Close();
Getting objects from files
TH1D* histo_from_file = my_file->Get<TH1D>("histo");
// Older alternative:
// TH1D* histo_from_file = (TH1D*) my_file->GetObject("histo");
if (!histo_from_file){
std::cerr << "Could not find histo on file!\n";
}
Looping over a TTree:
TTree* my_tree = my_file->Get<TTree>("tree");
int intVar;
float floatVar;
my_tree->SetBranchAddress("intVar", &intVar);
my_tree->SetBranchAddress("floatVar", &floatVar);
size_t n_events = my_tree->GetEntries();
for (size_t iev = 0; iev < n_events; ++iev) {
my_tree->GetEntry(iev);
// now you can use intVar and/or floatVar
}
ROOT tutorials and most people will teach to use new to create objects such as histograms and trees.
This is because the ROOT interpreter has some internal mechanism of ownership and memory management but in general it's not considered a good practice.
However, if you follow "good practice" you'll have problems due to how ROOT expects objects to live.
TTree* my_tree = my_file->Get<TTree>("tree");
int intVar;
float floatVar;
my_tree->SetBranchAddress("intVar", &intVar);
my_tree->SetBranchAddress("floatVar", &floatVar);
TH1D* my_histo_i = new TH1D("histo_i", "Title;Xlabel;Ylabel", 100, 0, 100);
TH1D* my_histo_f = new TH1D("histo_f", "Title;Xlabel;Ylabel", 100, 0, 100);
size_t n_events = my_tree->GetEntries();
for (size_t iev = 0; iev < n_events; ++iev) {
my_tree->GetEntry(iev);
my_histo_i->Fill(intVar);
my_histo_f->Fill(floatVar);
}
Histograms and graphs can be drawn on screen using the Draw() method.
It is often useful to do it on a pre-created TCanvas.
This allows you to specify the size of the window, divide the canvas into multiple sub-plots, and even print it as a png/pdf...
// ...
TH1D* my_histo1 = new TH1D("histo1", "Title;Xlabel;Ylabel", 100, 0, 100);
TH1D* my_histo2 = new TH1D("histo2", "Title;Xlabel;Ylabel", 100, 0, 100);
size_t n_events = my_tree->GetEntries();
for (size_t iev = 0; iev < n_events; ++iev) {
//...
}
TCanvas* my_canvas = new TCanvas("canvas", "My Title", 0, 0, 1024, 600);
my_canvas->Divide(2, 1); // 2 sub-canvas in one line
my_canvas->cd(1);
my_histo1->Draw();
my_canvas->cd(2);
my_histo2->Draw();
my_canvas->Print("my_plot.png");
As a first step try to visualize the distribution of each feature by comparing signal vs background.
You want to get a feeling of which variable is more powerful and which ones might need some normalization / scaling / transformation.
This will still be very useful since it's basically the first step in order to build a likelihood function.
As a first step try to visualize the distribution of each feature by comparing signal vs background.
You want to get a feeling of which variable is more powerful and which ones might need some normalization / scaling / transformation.
This will still be very useful since it's basically the first step in order to build a likelihood function.
(Bonus points if you manage to do also correlation plots)
Now we will build a Maximum Likelihood estimator
The Likelihood function is defined as the joint probability for observing the simultaneous realization of \(N\) random variables, as a function of their p.d.f. parameters
$$\mathcal L(\mathbf \theta) \equiv \prod_{i=1}^N P (\mathbf x_i; \mathbf \theta) $$
Usually you'd want to use this to find the set of parameters that maximise this function and use it as an estimator for some physical observable(s)
$$\hat \theta = \argmax_\theta \mathcal L (\theta)$$
But it can also be used as a binary (or multi-class) classifier
Under the assumption that all observations are i.i.d
Suppose we have a set of \(N\) features (i.e. variables) \(\mathbf x\) and our observations all belong to the set of classes \( \mathcal Y = \{+1, -1\} \) (i.e. signal and background)
Now let us define as \(f_{+1}\) and \(f_{-1}\) as the joint p.d.f. for \(\mathbf x\) for the two classes
$$ f_{+1} (\mathbf x) = f(\mathbf x \, | \, Y = +1) \, , \, f_{-1} (\mathbf x) = f(\mathbf x \, | \, Y = -1)$$
also known as conditional probabilities
Let us now define the Likelihood ratio as
$$ \lambda = \frac {\mathcal L_{+1}}{\mathcal L_{-1}} \equiv \frac {f_{+1} (\mathbf x)}{f_{-1} (\mathbf x)} $$
and we can generally assign a class to an event if \( \lambda > k \) for a given value of \(k\) that we will choose in order to reach the desired level of accuracy (and/or purity)
Now we turn our attention to the conditional probabilities
Even though it is almost always never the case, we can start by assuming all the features are independent from each other, so:
$$ \mathcal L_{\pm 1} = f_{\pm 1} (\mathbf x) =\prod_{i=1}^N P_i^{\pm 1} (x_i; \mathbf \theta) $$
and we can use the marginal p.d.f. of each feature to compute the two likelihoods.
It is usually helpful to work in term of the log-likelihood:
$$ \log \mathcal L_{\pm 1} = \sum_{i=1}^N \log P_i^{\pm 1} (x_i; \mathbf \theta) $$
which takes more easy-to-handle values
How do we proceed on building our estimator?
The split between two samples allows us to check for overfitting (i.e. if the classifier distributions don't agree between the two samples, we might have problems)
Remember: after you compute the p.d.f. they should have unitary integral!
Boosted Decision Trees
Likelihood classifiers are often not powerful enough.
In addition, they fall short for several reasons:
Several techniques can perform much better:
Suppor vector machines, Boosted decision trees, Dense neural networks, etc...
Likelihood classifiers are often not powerful enough.
In addition, they fall short for several reasons:
Several techniques can perform much better:
Suppor vector machines, Boosted decision trees, Dense neural networks, etc...
Let's start with a classical cut-based analysis
You might start by choosing a variable, applying a threshold and keeping only events above/below the threshold.
Rinse and repeat with another variable, until you reach the desired purity.
The only problem is that with each variable \(i\) you cut on, you select events with some efficiency \(\varepsilon_i\), and the total efficiency after all your selections will be
$$ \varepsilon = \prod_i \varepsilon_i $$
which will decrease significantly unless all your cuts have a very high efficiency...
What if instead of rejecting events if they fail one cut, we continue analyzing them?
Start with the "root" node, it will contain all our events. Then
This algorithm is "greedy", building on locally optimal choices regardless of the overall result.
At each node all variables are always considered. Even if already used for splitting another node.
DTs are "human readable". Just a bunch of selections.
To evaluate the tree, follow the cuts depending on each variable value until you reach a leaf node. The result could either be the leaf purity, or a binary decision based on some purity (\(s/s+b\)) threshold.
There are several parameters involved in building a decision tree
Our choice for splitting points should be based on some solid principles. For example, we might choose to split depending on the value of some function, related to the node impurity. Such a function should be:
Then we can quantify how much the separation improves after a split by evaluating the "impurity decrease" for a split \(S\) on a node \(t\)
$$ \Delta i(S,t) = i(t) - p_P i(t_P) - p_F i(t_F)$$
where \(p_{P/F}\) is the fraction of events that pass/fail the split condition. Our goal then reduces to finding the optimal split \(S^*\) such that
$$ \Delta i(S^*,t) = \max_{S \in \{\text{splits}\}} i(S, t) $$
Some common choices for the impurity function:
Decision trees are particularly good at dealing with high number of variables, and are resilient to many problems that affect other classifiers
That being said, they have a shortcomings as well, for example:
Trees can overfit, as any classifier, and there are several methods to mitigate this
Trees can overfit, as any classifier, and there are several methods to mitigate this
Several kind of ensemble learning for DTs:
Boosting attempts building trees that are progressively more specialized in classifying previously misclassified events.
Let's take a look at the first implementation as an example:
This initial idea can be further generalized
Consider a number \(N_\text{tree}\) of classifiers, where the \(k\)-th tree has been trained on sample \(\mathcal T_k\) with \(N_k\) events. Each event has a weight \(w^k_i\) and variables \(\mathbf x_i\), with class \(y_i\).
Now, for each sample \(k\)
The final classifier output will be a function \(F(T_1, ..., T_{N_\text{tree}})\), tipically a weighted average
$$ \lambda_i = \sum_{k=1}^{N_\text{tree}} \alpha_k T_k (\mathbf x_i) $$
and will be an almost-continuous variable
Take tree \(T_k\) and let us write the misclassification for event \(i\) as
$$ m_i^k = \mathcal I (y_i T_k(\mathbf x_i) < 0) $$
or (in case the tree output is the purity)
$$m_i^k = \mathcal I (y_i [T_k(\mathbf x_i)-0.5] < 0) $$
where \(\mathcal I(X)\) is 1 if \(X\) is true and 0 otherwise.
This way we can define the misclassification rate as
$$ R(T_k) = \varepsilon_k = \frac{\sum_{i=1}^{N_k} w_i^k m_i^k}{\sum_{i=1}^{N_k} w_i^k} $$
and assign the weight to the tree
$$ \alpha_k = \beta \log \frac{1 - \varepsilon_k}{\varepsilon_k} $$
where \(\beta\) is a free parameter (often called learning rate)
Now we can create the sample for the next tree by just adjusting the weight of each event
$$ w_i^k \rightarrow w_i^{k+1} = w_i^k e^{\alpha_k m_i^k} $$
Note how the weight increases only for misclassified events! Trees become increasingly focused on wrongly labelled events and will try harder to correctly classify them
The final output will then be
$$ \lambda_i = \frac{1}{\sum_k \alpha_k} \sum_k \alpha_k T_k (\mathbf x_i) $$
Interesting side note:
$$ \varepsilon \leq \prod_k 2 \sqrt {\varepsilon_k (1 - \varepsilon_k)} $$
which means that it goes to zero with \(N_\text{tree} \rightarrow \infty\) , which means overfitting
Another way to approach the same problem is to turn it into a minimization problem, where adding trees will go towards decreasing a chosen loss function
Let's take the classifier at step \(k\), \(T_k\), we aim now to improve it incrementally
$$ T_{k+1} (\mathbf x) = T_k(\mathbf x) + h(\mathbf x) $$
Now, instead of training a new classifier T_{k+1} we choose to train another classifier, specialized in fitting the residual \(h(\mathbf x)\)
This particular formulation can be seen as minimizing a quadratic loss function of the form
$$ L(\mathbf x, y) = \frac{1}{2} (y - T(\mathbf x))^2 $$
in fact
$$ \frac{\partial L}{\partial T(\mathbf x)} = T(\mathbf x) - y $$
It can be shown that AdaBoost can be recovered as a special case in which
$$ L(\mathbf x, y) = e^{-y T(\mathbf x)}$$
We will use the TMVA framework inside ROOT to build our tree(s). It will help us automate a lot of what we've seen so far, so that we don't have to worry with implementing everything from scratch this time.
We will use the TMVA framework inside ROOT to build our tree(s). It will help us automate a lot of what we've seen so far, so that we don't have to worry with implementing everything from scratch this time.
Let's see how we can train a BDT. We start with loading the TMVA library and creating a factory object.
TMVA::Tools::Instance();
auto factory = std::make_unique<TMVA::Factory>(
"StatExam", output_tfile,
"!V:!Silent:Color:DrawProgressBar:Transformations=I;D;P;G,D:AnalysisType=Classification");
We will use the TMVA framework inside ROOT to build our tree(s). It will help us automate a lot of what we've seen so far, so that we don't have to worry with implementing everything from scratch this time.
Let's see how we can train a BDT. We start with loading the TMVA library and creating a factory object. Then we create a "dataloader".
TMVA::Tools::Instance();
auto factory = std::make_unique<TMVA::Factory>(
"StatExam", output_tfile,
"!V:!Silent:Color:DrawProgressBar:Transformations=I;D;P;G,D:AnalysisType=Classification");
auto dataloader = std::make_unique<TMVA::DataLoader>("dataset");
We then need to inform the dataloader about the variables in our dataset
// for each variable:
dataloader->AddVariable(variable_name, variable_title, "", 'F');
We then need to inform the dataloader about the variables in our dataset
// for each variable:
dataloader->AddVariable(variable_name, variable_title, "", 'F');
You can even define new variables as combination of existing ones
dataloader->AddVariable("var1 + (var2 / 3)", "Combined variable", "", 'F');
We then need to inform the dataloader about the variables in our dataset
// for each variable:
dataloader->AddVariable(variable_name, variable_title, "", 'F');
You can even define new variables as combination of existing ones
dataloader->AddVariable("var1 + (var2 / 3)", "Combined variable", "", 'F');
then, add the two trees to the dataloader and let it "prepare" the training and validation sets
dataloader->AddSignalTree(tree_sig);
dataloader->AddBackgroundTree(tree_bkg);
dataloader->PrepareTrainingAndTestTree("", "SplitMode=Alternate:NormMode=NumEvents:!V");
We then need to inform the dataloader about the variables in our dataset
// for each variable:
dataloader->AddVariable(variable_name, variable_title, "", 'F');
You can even define new variables as combination of existing ones
dataloader->AddVariable("var1 + (var2 / 3)", "Combined variable", "", 'F');
then, add the two trees to the dataloader and let it "prepare" the training and validation sets
dataloader->AddSignalTree(tree_sig);
dataloader->AddBackgroundTree(tree_bkg);
dataloader->PrepareTrainingAndTestTree("", "SplitMode=Alternate:NormMode=NumEvents:!V");
the parameters you see manage how the two sets are created:
Check Section 3.1.4 of the TMVA manual for all the possible options
Now let's add a classification method from our factory
factory->BookMethod(dataloader.get(), TMVA::Types::kBDT, "BDT_R",
"!H:!V:NTrees=250:MinNodeSize=1.0%:MaxDepth=3:BoostType=AdaBoost:AdaBoostBeta=0.5:"
"SeparationType=GiniIndex:nCuts=50");
here you have a lot of parameters to choose and possibly optimize, some of will you should recognize already.
Check section 8.13 of the TMVA manual for a detailed explanation.
You can repeat this step, each time adding a new classifier that will be trained. Just remember to give it a different name (third parameter).
Now let's add a classification method from our factory
factory->BookMethod(dataloader.get(), TMVA::Types::kBDT, "BDT_R",
"!H:!V:NTrees=250:MinNodeSize=1.0%:MaxDepth=3:BoostType=AdaBoost:AdaBoostBeta=0.5:"
"SeparationType=GiniIndex:nCuts=50");
here you have a lot of parameters to choose and possibly optimize, some of will you should recognize already.
Check section 8.13 of the TMVA manual for a detailed explanation.
You can repeat this step, each time adding a new classifier that will be trained. Just remember to give it a different name (third parameter).
Final step: start the training and let TMVA compute the performance of all booked classifiers
// Train MVAs using the set of training events
factory->TrainAllMethods();
// Evaluate all MVAs using the set of test events
factory->TestAllMethods();
// Evaluate and compare performance of all configured MVAs
factory->EvaluateAllMethods();
At the end, the factory will write all the results in the TFile you provided as third argument in the factory constructor.
Inside you will find several directories with useful info.
There will be one "Method_" directory for each classifier you trained, and from there you can access the ROC and other properties of the classifier.
You can run the TMVA GUI to inspect the result of the training
// run in the root interpreter
TMVA::TMVAGui("output_file.root")
You can run the TMVA GUI to inspect the result of the training.
For example:
You can run the TMVA GUI to inspect the result of the training.
For example:
Train your BDTs!
Choose different parameters, try AdaBoost vs Gradient Boosting, or a different impurity function for splitting.
Document what you see and compare the results in your report.
Main goal: compare the ROC curves you get with the one from your homemade likelihood, and try to explain how and why they differ.