Binary classification

Classification

Identify and separate observations into distinct categories.

Examples: tag emails as spam, diagnose a particular disease in a patient, identify defective products in a factory line, discard background events in physics analysis, etc...

Classification

Any algorithm or procedure that maps a set of inputs into a discrete value is called a "Classifier"

In general this will be some kind of function of N variables (or features) grouping events into classes, according to the values of the associated variables.

Binary classification: only two categories/classes are considered

Relevant concepts

Confusion matrix:

Summary of the probability of correct and incorrect predictions

	S (predicted)	B (predicted)
S (real)	True positives	False negatives
B (real)	False positives	True negatives

Type I Error

Type II Error

Relevant concepts

Confusion matrix:

Summary of the probability of correct and incorrect predictions

Receiver Operating Characteristic (ROC)

Only valid for binary classifiers
Shows performance of the classifier for all working points
Can be summarized in a single value (Area Under Curve - AUC) to represent classification power

The dataset

Simulation from the CTA experiment

Goal: reject showers caused by hadrons, while keeping showers from gamma rays

10 features (variables)

We will try two methods

- Likelihood ratio

- Boosted Decision Trees (BDT)

The dataset

You will be given a ROOT file with two TTree objects inside (one for Signal events, one for Background events), each having 10 Branches

❯ root -l magic04.root
root [0]
Attaching file magic04.root as _file0...
(TFile *) 0x555bf47811c0
root [1] .ls
TFile**		magic04.root	
 TFile*		magic04.root	
  KEY: TTree	Signal;1	
  KEY: TTree	Background;1	
root [2] Signal->Print()
******************************************************************************
*Tree    :Signal    :                                                        *
*Entries :    12332 : Total =          501716 bytes  File  Size =     395259 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :fLength   : fLength/F                                              *
*Entries :    12332 : Total  Size=      50149 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    1 :fWidth    : fWidth/F                                               *
*Entries :    12332 : Total  Size=      50141 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    2 :fSize     : fSize/F                                                *
*Entries :    12332 : Total  Size=      50133 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    3 :fConc     : fConc/F                                                *
*Entries :    12332 : Total  Size=      50133 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    4 :fConc1    : fConc1/F                                               *
*Entries :    12332 : Total  Size=      50141 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    5 :fAsym     : fAsym/F                                                *
*Entries :    12332 : Total  Size=      50133 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    6 :fM3Long   : fM3Long/F                                              *
*Entries :    12332 : Total  Size=      50149 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    7 :fM3Trans  : fM3Trans/F                                             *
*Entries :    12332 : Total  Size=      50157 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    8 :fAlpha    : fAlpha/F                                               *
*Entries :    12332 : Total  Size=      50141 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    9 :fDist     : fDist/F                                                *
*Entries :    12332 : Total  Size=      50133 bytes  All baskets in memory   *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

Lesson 01

We will begin with an "exploratory" exercise.

Lesson 01

Before attempting to build a classifier it is always a good idea to visualize the dataset and get a feeling of how different the classes are, among the features of the dataset.

Refreshing the basics

Main goals:

Get the fundamentals of classification
Compare different classifiers
Get familiar with C++ and ROOT
(this will mean writing and running most of the code yourself)

You will find the data at ~vformato/2023/data/magic04.root on the course VM

Refreshing the basics

Opening a file

If you just need to read from the file then TFile::Open() is the easiest way
It returns a TFile*, you can either call delete on it when you're done, or TFile::Close.
If you don't sometimes can lead to strange crashes

TFile* my_file = TFile::Open("filename.root");

// ...

my_file->Close();

Refreshing the basics

Getting objects from files

Prefer using TFile::Get<T> rather than TFile::GetObject
(you can avoid a cast, just make sure you use the right type between <>)
You get a raw pointer back, but it will be null if the object is not on file.
(You might have to check that manually)
Objects from files go out of scope when the file is closed.
HUGE SOURCE OF BUGS

TH1D* histo_from_file = my_file->Get<TH1D>("histo");
// Older alternative:
// TH1D* histo_from_file = (TH1D*) my_file->GetObject("histo");

if (!histo_from_file){
  std::cerr << "Could not find histo on file!\n";
}

Refreshing the basics

Looping over a TTree:

After retrieving the tree from file you need a "placeholder" in memory
You tell the tree where the placeholder for each variable is with SetBranchAddress
Loop over all the entries in the tree and call GetEntry for each of them. Now the placeholders will contain the value of the tree variable for that particular entry.

TTree* my_tree = my_file->Get<TTree>("tree");

int intVar;
float floatVar;

my_tree->SetBranchAddress("intVar", &intVar);
my_tree->SetBranchAddress("floatVar", &floatVar);

size_t n_events = my_tree->GetEntries();
for (size_t iev = 0; iev < n_events; ++iev) {
  my_tree->GetEntry(iev);
  
  // now you can use intVar and/or floatVar
}

Creating objects

ROOT tutorials and most people will teach to use new to create objects such as histograms and trees.

This is because the ROOT interpreter has some internal mechanism of ownership and memory management but in general it's not considered a good practice.

However, if you follow "good practice" you'll have problems due to how ROOT expects objects to live.

TTree* my_tree = my_file->Get<TTree>("tree");

int intVar;
float floatVar;

my_tree->SetBranchAddress("intVar", &intVar);
my_tree->SetBranchAddress("floatVar", &floatVar);

TH1D* my_histo_i = new TH1D("histo_i", "Title;Xlabel;Ylabel", 100, 0, 100);
TH1D* my_histo_f = new TH1D("histo_f", "Title;Xlabel;Ylabel", 100, 0, 100);

size_t n_events = my_tree->GetEntries();
for (size_t iev = 0; iev < n_events; ++iev) {
  my_tree->GetEntry(iev);
  
  my_histo_i->Fill(intVar);
  my_histo_f->Fill(floatVar);
}

Drawing objects

Histograms and graphs can be drawn on screen using the Draw() method.

It is often useful to do it on a pre-created TCanvas.
This allows you to specify the size of the window, divide the canvas into multiple sub-plots, and even print it as a png/pdf...

// ...
TH1D* my_histo1 = new TH1D("histo1", "Title;Xlabel;Ylabel", 100, 0, 100);
TH1D* my_histo2 = new TH1D("histo2", "Title;Xlabel;Ylabel", 100, 0, 100);

size_t n_events = my_tree->GetEntries();
for (size_t iev = 0; iev < n_events; ++iev) {
  //...
}

TCanvas* my_canvas = new TCanvas("canvas", "My Title", 0, 0, 1024, 600);
my_canvas->Divide(2, 1); // 2 sub-canvas in one line

my_canvas->cd(1);
my_histo1->Draw();

my_canvas->cd(2);
my_histo2->Draw();

my_canvas->Print("my_plot.png");

Goal for today

As a first step try to visualize the distribution of each feature by comparing signal vs background.

You want to get a feeling of which variable is more powerful and which ones might need some normalization / scaling / transformation.

This will still be very useful since it's basically the first step in order to build a likelihood function.

Goal for today

As a first step try to visualize the distribution of each feature by comparing signal vs background.

You want to get a feeling of which variable is more powerful and which ones might need some normalization / scaling / transformation.

This will still be very useful since it's basically the first step in order to build a likelihood function.

(Bonus points if you manage to do also correlation plots)

Lesson 02/03

Now we will build a Maximum Likelihood estimator

What is a Likelihood

The Likelihood function is defined as the joint probability for observing the simultaneous realization of $N$ random variables, as a function of their p.d.f. parameters
$$\mathcal L(\mathbf \theta) \equiv \prod_{i=1}^N P (\mathbf x_i; \mathbf \theta) $$

Usually you'd want to use this to find the set of parameters that maximise this function and use it as an estimator for some physical observable(s)

$$\hat \theta = \argmax_\theta \mathcal L (\theta)$$

But it can also be used as a binary (or multi-class) classifier

Under the assumption that all observations are i.i.d

Likelihood ratio classifier

Suppose we have a set of $N$ features (i.e. variables) $\mathbf x$ and our observations all belong to the set of classes $ \mathcal Y = \{+1, -1\} $ (i.e. signal and background)

Now let us define as $f_{+1}$ and $f_{-1}$ as the joint p.d.f. for $\mathbf x$ for the two classes

$$ f_{+1} (\mathbf x) = f(\mathbf x \, | \, Y = +1) \, , \, f_{-1} (\mathbf x) = f(\mathbf x \, | \, Y = -1)$$

also known as conditional probabilities

Let us now define the Likelihood ratio as

$$ \lambda = \frac {\mathcal L_{+1}}{\mathcal L_{-1}} \equiv \frac {f_{+1} (\mathbf x)}{f_{-1} (\mathbf x)} $$

and we can generally assign a class to an event if $ \lambda > k $ for a given value of $k$ that we will choose in order to reach the desired level of accuracy (and/or purity)

log-likelihood

Now we turn our attention to the conditional probabilities

Even though it is almost always never the case, we can start by assuming all the features are independent from each other, so:
$$ \mathcal L_{\pm 1} = f_{\pm 1} (\mathbf x) =\prod_{i=1}^N P_i^{\pm 1} (x_i; \mathbf \theta) $$

and we can use the marginal p.d.f. of each feature to compute the two likelihoods.

It is usually helpful to work in term of the log-likelihood:

$$ \log \mathcal L_{\pm 1} = \sum_{i=1}^N \log P_i^{\pm 1} (x_i; \mathbf \theta) $$

which takes more easy-to-handle values

log-likelihood

How do we proceed on building our estimator?

We split our data into two sets: Training and Validation
The set size is up to you, a simple 50-50 split is often enough
We build the marginal p.d.f. for each feature and each class. A very rudimentary but effective approach is to store them as histograms.
For each event in the Validation sample we then compute the likelihood ratio, and plot its distribution for the two classes.
We can then compute the confusion matrix and the ROC (by varying the threshold $k$ and measuring the true-positive rate vs the false-negative rate for each $k$ value)

The split between two samples allows us to check for overfitting (i.e. if the classifier distributions don't agree between the two samples, we might have problems)

Remember: after you compute the p.d.f. they should have unitary integral!

Lesson 04

Boosted Decision Trees

Why?

Likelihood classifiers are often not powerful enough.

In addition, they fall short for several reasons:

Curse of high dimensionality: The more variables you have the more data you need to train one. Especially if you don't settle for 1D projections but you want to build a proper multivariate likelihood.
Correlation between variables is hard to account for (basically for the same reason)
They are suboptimal when the separation between signal and background is weak.

Several techniques can perform much better:

Suppor vector machines, Boosted decision trees, Dense neural networks, etc...

Why?

Likelihood classifiers are often not powerful enough.

In addition, they fall short for several reasons:

Curse of high dimensionality: The more variables you have the more data you need to train one. Especially if you don't settle for 1D projections but you want to build a proper multivariate likelihood.
Correlation between variables is hard to account for (basically for the same reason)
They are suboptimal when the separation between signal and background is weak.

Several techniques can perform much better:

Suppor vector machines, Boosted decision trees, Dense neural networks, etc...

Decision trees

Let's start with a classical cut-based analysis

You might start by choosing a variable, applying a threshold and keeping only events above/below the threshold.

Rinse and repeat with another variable, until you reach the desired purity.

The only problem is that with each variable $i$ you cut on, you select events with some efficiency $\varepsilon_i$, and the total efficiency after all your selections will be

$$ \varepsilon = \prod_i \varepsilon_i $$

which will decrease significantly unless all your cuts have a very high efficiency...

What if instead of rejecting events if they fail one cut, we continue analyzing them?

Building a decision tree

Start with the "root" node, it will contain all our events. Then

Check if some "stopping condition" applies
For each variable, create a list of all the events in the node, sorted along that variable
For each list find the optimal splitting point that maximizes sig/bkg separation. If no split improves the separation declare node as "leaf" and break.
Select variable with the maximal splitting and create two child nodes (one with events that pass, one with events that fail)
Iterate this procedure on each node

Building a decision tree

This algorithm is "greedy", building on locally optimal choices regardless of the overall result.

At each node all variables are always considered. Even if already used for splitting another node.

DTs are "human readable". Just a bunch of selections.

To evaluate the tree, follow the cuts depending on each variable value until you reach a leaf node. The result could either be the leaf purity, or a binary decision based on some purity ($s/s+b$) threshold.

Decision tree parameters

There are several parameters involved in building a decision tree

Signal/background normalization, this is basically the relative amount (or weighted relative amount) of signal events and background events. A sample with a 50/50 population is said to be "balanced".
However, this affects only the top nodes of the tree since deeper nodes tend to be more balanced anyway (after all the easy cuts are discovered, the more "difficult" ones remain)
Minimum leaf size
To reduce the impact of statistical fluctuations you might choose to avoid splitting a node if the number of events is below a given threshold
Maximum tree depth
To reduce tree complexity and mitigate overfitting

How to split a tree? 🪓

Our choice for splitting points should be based on some solid principles. For example, we might choose to split depending on the value of some function, related to the node impurity. Such a function should be:

Maximal if the node is balanced (50% signal, 50% background)
Minimal if the node only contains events from a single class
Symmetrical w.r.t. signal and background
Strictly concave (should favor purer nodes)

Then we can quantify how much the separation improves after a split by evaluating the "impurity decrease" for a split $S$ on a node $t$

$$ \Delta i(S,t) = i(t) - p_P i(t_P) - p_F i(t_F)$$

where $p_{P/F}$ is the fraction of events that pass/fail the split condition. Our goal then reduces to finding the optimal split $S^*$ such that

$$ \Delta i(S^*,t) = \max_{S \in \{\text{splits}\}} i(S, t) $$

How to split a tree? 🪓

Some common choices for the impurity function:

Misclassification error $i(t) = 1 - \max(p, 1-p)$
Cross entropy $i(t) = \sum_{i=s,b}p_i \log(p+i)$
Gini index

Variables

Decision trees are particularly good at dealing with high number of variables, and are resilient to many problems that affect other classifiers

They are less sensitive to the "curse of dimensionality". Training time $\propto nN \log N$ where $n$ is the number of variables and $N$ the number of events.
They are invariant under monotone transformation of any variable
They are immune to the presence of duplicate variables
They can handle both continuous and discrete variables
They can handle correlation between variables, even if suboptimally

That being said, they have a shortcomings as well, for example:

They are very sensitive to the training sample composition (adding just a few events might result in completely different splits)

Ensemble learning

Trees can overfit, as any classifier, and there are several methods to mitigate this

Early stopping
Already discussed, though stopping early might prevent further improvement
Pruning
Remove overly specialized branches to mitigate overfitting, does not help with training stability
Ensemble learning
Helps with stability: train several weak classifiers and leverage their collective results to build a more powerful classifier.

Ensemble learning

Trees can overfit, as any classifier, and there are several methods to mitigate this

Early stopping
Already discussed, though stopping early might prevent further improvement
Pruning
Remove overly specialized branches to mitigate overfitting, does not help with training stability
Ensemble learning
Helps with stability: train several weak classifiers and leverage their collective results to build a more powerful classifier.

Several kind of ensemble learning for DTs:

Bagging (bootstrap several subsamples for tree training)
Random forests (create random training subsamples)
Boosting

Boosting

Boosting attempts building trees that are progressively more specialized in classifying previously misclassified events.

Let's take a look at the first implementation as an example:

Train tree $T_1$ on a sample with $N$ events
Train tree $T_2$ on a new sample with $N$ events where half were misclassified by $T_1$
Train tree $T_3$ on a sample with events where $T_1$ and $T_2$ disagree
Take the majority vote between $T_1$, $T_2$, $T_3$ as the resulting classifier

This initial idea can be further generalized

Boosting

Consider a number $N_\text{tree}$ of classifiers, where the $k$-th tree has been trained on sample $\mathcal T_k$ with $N_k$ events. Each event has a weight $w^k_i$ and variables $\mathbf x_i$, with class $y_i$.

Now, for each sample $k$

Train classifier $T_k$ on sample $\mathcal T_k$
Compute weight $\alpha_k$for classifier $T_k$
"Transform" sample $T_k$ in sample $T_{k+1}$

The final classifier output will be a function $F(T_1, ..., T_{N_\text{tree}})$, tipically a weighted average

$$ \lambda_i = \sum_{k=1}^{N_\text{tree}} \alpha_k T_k (\mathbf x_i) $$

and will be an almost-continuous variable

AdaBoost

Take tree $T_k$ and let us write the misclassification for event $i$ as

$$ m_i^k = \mathcal I (y_i T_k(\mathbf x_i) < 0) $$

or (in case the tree output is the purity)

$$m_i^k = \mathcal I (y_i [T_k(\mathbf x_i)-0.5] < 0) $$

where $\mathcal I(X)$ is 1 if $X$ is true and 0 otherwise.

This way we can define the misclassification rate as

$$ R(T_k) = \varepsilon_k = \frac{\sum_{i=1}^{N_k} w_i^k m_i^k}{\sum_{i=1}^{N_k} w_i^k} $$

and assign the weight to the tree

$$ \alpha_k = \beta \log \frac{1 - \varepsilon_k}{\varepsilon_k} $$

where $\beta$ is a free parameter (often called learning rate)

AdaBoost

Now we can create the sample for the next tree by just adjusting the weight of each event

$$ w_i^k \rightarrow w_i^{k+1} = w_i^k e^{\alpha_k m_i^k} $$

Note how the weight increases only for misclassified events! Trees become increasingly focused on wrongly labelled events and will try harder to correctly classify them

The final output will then be

$$ \lambda_i = \frac{1}{\sum_k \alpha_k} \sum_k \alpha_k T_k (\mathbf x_i) $$

Interesting side note:

$$ \varepsilon \leq \prod_k 2 \sqrt {\varepsilon_k (1 - \varepsilon_k)} $$

which means that it goes to zero with $N_\text{tree} \rightarrow \infty$ , which means overfitting

Gradient boosting

Another way to approach the same problem is to turn it into a minimization problem, where adding trees will go towards decreasing a chosen loss function

Let's take the classifier at step $k$, $T_k$, we aim now to improve it incrementally

$$ T_{k+1} (\mathbf x) = T_k(\mathbf x) + h(\mathbf x) $$

Now, instead of training a new classifier T_{k+1} we choose to train another classifier, specialized in fitting the residual $h(\mathbf x)$

This particular formulation can be seen as minimizing a quadratic loss function of the form

$$ L(\mathbf x, y) = \frac{1}{2} (y - T(\mathbf x))^2 $$

in fact

$$ \frac{\partial L}{\partial T(\mathbf x)} = T(\mathbf x) - y $$

It can be shown that AdaBoost can be recovered as a special case in which

$$ L(\mathbf x, y) = e^{-y T(\mathbf x)}$$

Detailed treatment: here

Lecture 04b

We will use the TMVA framework inside ROOT to build our tree(s). It will help us automate a lot of what we've seen so far, so that we don't have to worry with implementing everything from scratch this time.

Lecture 04b

Let's see how we can train a BDT. We start with loading the TMVA library and creating a factory object.

TMVA::Tools::Instance();

auto factory = std::make_unique<TMVA::Factory>(
    "StatExam", output_tfile,
    "!V:!Silent:Color:DrawProgressBar:Transformations=I;D;P;G,D:AnalysisType=Classification");

Lecture 04b

Let's see how we can train a BDT. We start with loading the TMVA library and creating a factory object. Then we create a "dataloader".

TMVA::Tools::Instance();

auto factory = std::make_unique<TMVA::Factory>(
    "StatExam", output_tfile,
    "!V:!Silent:Color:DrawProgressBar:Transformations=I;D;P;G,D:AnalysisType=Classification");
    
auto dataloader = std::make_unique<TMVA::DataLoader>("dataset");

Lecture 04b

We then need to inform the dataloader about the variables in our dataset

// for each variable:
dataloader->AddVariable(variable_name, variable_title, "", 'F');

Lecture 04b

We then need to inform the dataloader about the variables in our dataset

// for each variable:
dataloader->AddVariable(variable_name, variable_title, "", 'F');

You can even define new variables as combination of existing ones

dataloader->AddVariable("var1 + (var2 / 3)", "Combined variable", "", 'F');

Lecture 04b

We then need to inform the dataloader about the variables in our dataset

// for each variable:
dataloader->AddVariable(variable_name, variable_title, "", 'F');

You can even define new variables as combination of existing ones

dataloader->AddVariable("var1 + (var2 / 3)", "Combined variable", "", 'F');

then, add the two trees to the dataloader and let it "prepare" the training and validation sets

dataloader->AddSignalTree(tree_sig);
dataloader->AddBackgroundTree(tree_bkg);

dataloader->PrepareTrainingAndTestTree("", "SplitMode=Alternate:NormMode=NumEvents:!V");

Lecture 04b

We then need to inform the dataloader about the variables in our dataset

// for each variable:
dataloader->AddVariable(variable_name, variable_title, "", 'F');

You can even define new variables as combination of existing ones

dataloader->AddVariable("var1 + (var2 / 3)", "Combined variable", "", 'F');

then, add the two trees to the dataloader and let it "prepare" the training and validation sets

dataloader->AddSignalTree(tree_sig);
dataloader->AddBackgroundTree(tree_bkg);

dataloader->PrepareTrainingAndTestTree("", "SplitMode=Alternate:NormMode=NumEvents:!V");

the parameters you see manage how the two sets are created:

You can apply a "selection cut" (empty in the example)
We are choosing to alternately assign each event to training or validation, in order, and to split the sets in half, keeping the original signal/background balance.

Check Section 3.1.4 of the TMVA manual for all the possible options

Lecture 04b

Now let's add a classification method from our factory

factory->BookMethod(dataloader.get(), TMVA::Types::kBDT, "BDT_R",
                    "!H:!V:NTrees=250:MinNodeSize=1.0%:MaxDepth=3:BoostType=AdaBoost:AdaBoostBeta=0.5:"
                    "SeparationType=GiniIndex:nCuts=50");

here you have a lot of parameters to choose and possibly optimize, some of will you should recognize already.

Check section 8.13 of the TMVA manual for a detailed explanation.

You can repeat this step, each time adding a new classifier that will be trained. Just remember to give it a different name (third parameter).

Lecture 04b

Now let's add a classification method from our factory

factory->BookMethod(dataloader.get(), TMVA::Types::kBDT, "BDT_R",
                    "!H:!V:NTrees=250:MinNodeSize=1.0%:MaxDepth=3:BoostType=AdaBoost:AdaBoostBeta=0.5:"
                    "SeparationType=GiniIndex:nCuts=50");

here you have a lot of parameters to choose and possibly optimize, some of will you should recognize already.

Check section 8.13 of the TMVA manual for a detailed explanation.

You can repeat this step, each time adding a new classifier that will be trained. Just remember to give it a different name (third parameter).

Final step: start the training and let TMVA compute the performance of all booked classifiers

// Train MVAs using the set of training events
factory->TrainAllMethods();

// Evaluate all MVAs using the set of test events
factory->TestAllMethods();

// Evaluate and compare performance of all configured MVAs
factory->EvaluateAllMethods();

Lecture 04b

At the end, the factory will write all the results in the TFile you provided as third argument in the factory constructor.

Inside you will find several directories with useful info.

There will be one "Method_" directory for each classifier you trained, and from there you can access the ROC and other properties of the classifier.

Lecture 04b

You can run the TMVA GUI to inspect the result of the training

// run in the root interpreter
TMVA::TMVAGui("output_file.root")

Lecture 04b

You can run the TMVA GUI to inspect the result of the training.

For example:

visualize the training set
visualize correlations in the training set
check the results of your training

Lecture 04b

You can run the TMVA GUI to inspect the result of the training.

For example:

visualize the training set
visualize correlations in the training set
check the results of your training
visualize each BDT

Lecture 04b

Train your BDTs!

Choose different parameters, try AdaBoost vs Gradient Boosting, or a different impurity function for splitting.

Document what you see and compare the results in your report.

Main goal: compare the ROC curves you get with the one from your homemade likelihood, and try to explain how and why they differ.