Ntuples for AMS-Italy Analysis
Introduction to the framework and data-format
Resource optimization
With multiple groups producing each their own set of ntuples, lots of data is replicated on disk, which results in a waste of resources.
Also, groups are competing for computing resources for ntuple production.
Code exchange
Using the same data makes it easier to exchange selections and algorithms/procedures.
Reproducibility / readability
Most often custom data formats are produced in a custom way, with a custom processing.
Additionally, in many cases, only the code owner can easily understand what's going in the analysis code.
# MOTIVATION
# PRINCIPLES
# PRINCIPLES
# STARTING
Requirements:
This mainly applies if you want to install NAIA on your personal machine. For distributed use (CNAF / CERN) all requirements and NAIA binaries are distributed via CVMFS
/cvmfs/ams.cern.ch/Offline/amsitaly/public/install/x86_64-centos7-gcc9.3/naia
and the correct environment can be setup with a dedicated script
/cvmfs/ams.cern.ch/Offline/amsitaly/public/install/x86_64-centos7-gcc9.3/naia/v1.0.0/setenvs/setenv_gcc6.26_cc7.sh
# STARTING
If you are building NAIA on your machine the installation is quite easy
# clone NAIA code
git clone ssh://git@gitlab.cern.ch:7999/ams-italy/naia.git -b v1.0.1 # (clone via SSH)
# setup build and final install directories
mkdir naia.build naia.install
# build NAIA
cd naia.build
cmake ../naia -DCMAKE_INSTALL_PREFIX=../naia.install
make all install
To use the NAIA ntuples your project will need:
# DATA MODEL
Our data model starts with the NAIAChain object
This is the main way to open a NAIA rootfile, it will take care of loading all the relevant TTrees and setting up what we call the "read-on-demand" mechanism (more on this later)
Example:
// ...
#include "Chain/NAIAChain.h"
int main(int argc, char const *argv[]) {
// Create a chain object
NAIA::NAIAChain chain;
// add one (or more) file to it
chain.Add("somefile.root");
// setup the read-on-demand mechanism // N.B: important and mandatory!
chain.SetupBranches();
}
# DATA MODEL
Once your chain is created and ready to use, you can easily loop over all the events in the chain, with the help of the Event class
// ...
#include "Chain/NAIAChain.h"
int main(int argc, char const *argv[]) {
NAIA::NAIAChain chain;
chain.Add("somefile.root");
chain.SetupBranches();
// Event loop!
for (Event& event : chain){
// your analysis here :)
}
}
(you can use the NAIAChain::GetEvent() method for index-based looping, if needed)
# DATA MODEL
NAIA also provide a simple way of skimming a chain and only save interesting events in the output file
// ...
#include "Chain/NAIAChain.h"
int main(int argc, char const *argv[]) {
NAIA::NAIAChain chain;
chain.Add("somefile.root");
chain.SetupBranches();
auto handle = chain.CreateSkimTree("skimmed.root", "");
// Event loop!
for (Event& event : chain){
if( is_interesting(event) ){
handle.Fill();
}
}
handle.Write();
}
# DATA MODEL
NAIA also provide a simple way of skimming a chain and only save interesting events in the output file
// ...
#include "Chain/NAIAChain.h"
int main(int argc, char const *argv[]) {
NAIA::NAIAChain chain;
chain.Add("somefile.root");
chain.SetupBranches();
auto handle = chain.CreateSkimTree("skimmed.root", "");
// Event loop!
for (Event& event : chain){
if( is_interesting(event) ){
handle.Fill();
}
}
handle.Write();
}
We create a SkimTreeHandle from the original chain
# DATA MODEL
NAIA also provide a simple way of skimming a chain and only save interesting events in the output file
// ...
#include "Chain/NAIAChain.h"
int main(int argc, char const *argv[]) {
NAIA::NAIAChain chain;
chain.Add("somefile.root");
chain.SetupBranches();
auto handle = chain.CreateSkimTree("skimmed.root", "");
// Event loop!
for (Event& event : chain){
if( is_interesting(event) ){
handle.Fill();
}
}
handle.Write();
}
We can specify branches we're not interested in saving here, with a semicolon-separated list
# DATA MODEL
NAIA also provide a simple way of skimming a chain and only save interesting events in the output file
// ...
#include "Chain/NAIAChain.h"
int main(int argc, char const *argv[]) {
NAIA::NAIAChain chain;
chain.Add("somefile.root");
chain.SetupBranches();
auto handle = chain.CreateSkimTree("skimmed.root", "");
// Event loop!
for (Event& event : chain){
if( is_interesting(event) ){
handle.Fill();
}
}
handle.Write();
}
We fill it when we find an interesting event
# DATA MODEL
NAIA also provide a simple way of skimming a chain and only save interesting events in the output file
// ...
#include "Chain/NAIAChain.h"
int main(int argc, char const *argv[]) {
NAIA::NAIAChain chain;
chain.Add("somefile.root");
chain.SetupBranches();
auto handle = chain.CreateSkimTree("skimmed.root", "");
// Event loop!
for (Event& event : chain){
if( is_interesting(event) ){
handle.Fill();
}
}
handle.Write();
}
When we're done we write it to disk
# DATA MODEL
The Event class is probably the most important one, but also the most boring since it's basically a proxy class containing a collection of Containers
Containers are the real building blocks of the NAIA datamodel.
Each container is associated to a single branch in the main TTree and allows for reading the corresponding branch data only when first accessed.
(This means that if you never use a particular container in your analysis, you’ll never read the corresponding data from file)
# DATA MODEL
Container is the general term to define a class in the NAIA data model that groups several variables, according to specific criteria (e.g. all the variables related to the TOF).
Most containers come in two variants: the Base and the Plus variant
The Base variant contains variables that are accessed by almost every analysis or that are accessed most often
The Plus variant contains variables that won't be needed by everyone, or may be needed less frequently
The important thing is that you always access variables using the -> operator on the container. This is how the read-on-demand is implemented and leads to wrong results otherwise.
# DATA MODEL
Variables in NAIA are a bit more complex, for valid reasons:
We want our data model to be as light as possible (especially since we're processing and saving every single event)
This implies that if something's missing we don't want to write anything to disk
We achieve this by using associative containers (mostly std::map) and realizing that in many cases there are patterns we can exploit.
Example: several variables come in "flavors" or are computed by different reconstructions
# DATA MODEL
Doing AMS analysis means constantly dealing with "one value for each X type", where X could be a charge reconstruction method, track fitting algorithm, ECAL BDT estimator, and so on…
Example: For tracker hits, there are three available charge reconstruction methods: STD, Hu Liu, Yi Jia.
In general, these reconstructions are not guaranteed always to succeed
We handle those by defining the following:
// one number per charge reconstruction type
template <class T> using TrackChargeVariable = std::map<TrTrack::ChargeRecoType, T>;
Where TrTrack::ChargeRecoType is an enum with three values
enum ChargeRecoType {
STD, ///< Standard tracker charge reconstruction
HL, ///< Hu Liu reconstruction
YJ, ///< Yi Jia reconstruction
};
# DATA MODEL
Why an enum?!?
Well... what is more readable and understandable
float inner_charge = event.trTrackBase->InnerCharge[2];
or
float inner_charge = event.trTrackBase->InnerCharge[TrTrack::ChargeRecoType::YJ];
Readability debates aside, this avoids the confusion usually brought by magic numbers (you might not remember after a few weeks that Yi Jia reconstruction is at index 2, for example)
In addition, if this ever changes in the future, and it is moved to index 3 you won't have to modify your code in the second case.
For this reason almost every variable structure in NAIA is accessed using specific enums.
# DATA MODEL
Some variables have nested structures, for example in TrTrackPlus we have:
///< Track hit charge (X and Y-side) for each layer, for each charge reconstruction.
LayerVariable<TrackChargeVariable<TrackSideVariable<float>>> LayerCharge;
where for each layer, then for each reconstruction, then for each side we store a number.
But it is not guaranteed that the track will have a hit on, say, Layer 1. Or that the underlying cluster is correctly identified on the X side. How do we check for this?
For this, there is a dedicated ContainsKeys function, which checks if the desired elements (identified by some keys, i.e. the aforementioned enums) exist in the structure
if (ContainsKeys(event.trTrackPlus->LayerCharge, layer_idx, Track::ChargeRecoType::YJ, TrTrack::Side::X)) {
// do stuff...
}
you can find the full list of variable structures here
# DATA MODEL
One quick way of discarding uninteresting events without reading almost anything is by using the event mask.
The mask is simply a bitmask where every bit represents a particular Category. If the event satisfies a given Category, the corresponding bit in the mask will be set.
// ...
#include "Chain/NAIAChain.h"
int main(int argc, char const *argv[]) {
NAIA::NAIAChain chain;
chain.Add("somefile.root");
chain.SetupBranches();
auto handle = chain.CreateSkimTree("skimmed.root", "");
// Event loop!
for (Event& event : chain){
if( event.CheckMask(NAIA::Category::Charge1_Tof | NAIA::Category::Charge1_Trk) ){
handle.Fill();
}
}
handle.Write();
}
# DATA MODEL
To do analysis you don't need only Events, but also information about livetime, or amount of generated MC events...
Each NAIA file contains two additional trees just for that
One contains data about the ISS position, its orientation, and physical quantities connected to them, as well as some time-averaged data about the run itself. This kind of data is usually retrieved in AMS analysis from the RTI (Real Time Information) database. This database stores data with a time granularity of one second, and it can be accessed using the gbatch library.
We don't want any dependency on gbatch so the entire RTI database is converted to a TTree that has only one branch, which contains objects of the RTIInfo class, one for each second of the current run.
// inside the event loop
// Get the RTI info for the current event
NAIA::RTIInfo &rti_info = chain.GetEventRTIInfo();
# DATA MODEL
We don't want any dependency on gbatch so the entire RTI database is converted to a TTree that has only one branch, which contains objects of the RTIInfo class, one for each second of the current run.
The tree can be accessed from outside the event loop as well
TChain* rti_chain = chain.GetRTITree();
NAIA::RTIInfo* rti_info = new RTIInfo();
rti_chain->SetBranchAddress("RTIInfo", &rti_info);
for(unsigned long long isec=0; isec < rti_chain->GetEntries(); ++isec){
rti_chain->GetEntry(isec);
// your analysis here :)
}
Clearly useful if you only have to just recompute livetime or analyse only RTI data
# DATA MODEL
The second tree contains useful information about the original AMSRoot file from which the current NAIA file was derived.
This information is stored in the FileInfo TTree, which usually has only a single entry for each NAIA root-file.
(Having this data in a TTree allows us to chain multiple NAIA root-files and still be able to retrieve the FileInfo data for the current run we’re processing)
This tree has one branch, which contains objects of the FileInfo class and, if the NAIA root-file is a Montecarlo file, an additional branch containing objects of the MCFileInfo class.
// inside the event loop
// Get the File infos for the current event
NAIA::FileInfo &file_info = chain.GetEventFileInfo();
// if this is a MC file
NAIA::MCFileInfo &mcfile_info = chain.GetEventMCFileInfo();
# DATA MODEL
Also in this case the tree can be accessed from outside the event loop as well
TChain* file_chain = chain.GetFileInfoTree();
NAIA::FileInfo* file_info = new NAIA::FileInfo();
NAIA::MCFileInfo* mcfile_info = new NAIA::MCFileInfo();
file_chain->SetBranchAddress("FileInfo", &file_info);
if(chain.IsMC()){
file_chain->SetBranchAddress("MCFileInfo", &mcfile_info);
}
for(unsigned long long i=0; i < file_chain->GetEntries(); ++i){
file_chain->GetEntry(i);
// do stuff with file_info
if(chain.IsMC()){
// do stuff with mcfile_info
}
}
# USAGE
There are some examples in NAIA that should guide you in building your analysis with NAIA. These are divided by usage type:
# USAGE
There are some examples in NAIA that should guide you in building your analysis with NAIA. These are divided by usage type:
CMake: (recommended)
It is a powerful cross-platform build system that is used to specify in a generic way how programs should be compiled and generate the corresponding Makefiles
It is especially useful when your project makes use of other packages / libraries that need to be imported
# CMakeLists.txt:
project(testNAIA)
set(CMAKE_CXX_STANDARD 14)
find_package(NAIA 1.0.1 REQUIRED)
add_executable(main src/main.cpp)
target_link_libraries(main PUBLIC NAIA::NAIAChain)
it becomes quite effective when the size of the project increases (many executables/libraries)
# USAGE
# CMakeLists.txt:
project(testNAIA)
set(CMAKE_CXX_STANDARD 14)
find_package(NAIA 1.0.1 REQUIRED)
add_executable(main src/main.cpp)
target_link_libraries(main PUBLIC NAIA::NAIAChain)
Note: following good CMake practices, NAIA internally defines everything that is needed in terms of targets.
The NAIA::NAIAChain target internally knows all the include paths, preprocessor macros, library paths, libraries that it needs so that CMake can propagate these requirements to all targets linking against NAIA::NAIAChain, such as your own library targets or executables.
To compile just run
mkdir build
cd build
cmake .. -DNAIA_DIR=${path_to_your_naia_install}/cmake
make
# USAGE
Makefile:
If you want to write your own Makefile you can take a look at the existing examples and update the NAIA_DIR variable inside. Remember to add include paths/libraries if needed or if something changes between NAIA versions.
NAIA_DIR=/path/to/your/naia/install
CC = g++
CFLAGS = $(shell root-config --cflags) -DSPDLOG_FMT_EXTERNAL
INCLUDES = -I$(NAIA_DIR)/include -I./include
LFLAGS = $(shell root-config --libs) -L $(NAIA_DIR)/lib -Wl,-rpath=$(NAIA_DIR)/lib
LIBS = -lNAIAUtility -lNAIAChain -lNAIAContainers
SRCS = src/main.cpp
OBJS = $(SRCS:.cpp=.o)
MAIN = main
.PHONY: depend clean
all: $(MAIN)
@echo main has been compiled
$(MAIN): $(OBJS)
$(CC) $(CFLAGS) $(INCLUDES) -o $(MAIN) $(OBJS) $(LFLAGS) $(LIBS)
.cpp.o:
$(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@
clean:
$(RM) *.o src/*.o *~ $(MAIN)
depend: $(SRCS)
makedepend $(INCLUDES) $^
# DO NOT DELETE THIS LINE -- make depend needs it
# USAGE
ROOT macros:
In this case the libraries and include paths are setup by a dedicated macro
// load.C:
{
TString naia_dir = "/path/to/your/naia/install/dir";
gROOT->ProcessLine(".include" + naia_dir + "/include");
gSystem->SetDynamicPath(naia_dir + "/lib:" + gSystem->GetDynamicPath());
gSystem->Load("libNAIAUtility");
gSystem->Load("libNAIAContainers");
gSystem->Load("libNAIAChain");
gROOT->ProcessLine("#define SPDLOG_FMT_EXTERNAL");
gROOT->ProcessLine("#include \"Chain/NAIAChain.h\"");
}
and then run as
root load.C main.cpp
# USAGE
Bonus: RDataFrame
Sometimes you do need to make a quick plot and macros are just for that. One very cool option could be to use RDataFrame. You won't use a NAIAChain, in this case, you're working directly with the naked branches (and you still need load.C)
void plot_innercharge() {
// enable multi-threading
ROOT::EnableImplicitMT();
TFile *infile = TFile::Open("/storage/gpfs_ams/ams/groups/AMS-Italy/ntuples/v0.0.1/ISS.B1130/pass7/1591361896.root");
TTree *tree = infile->Get<TTree>("NAIAChain");
ROOT::RDataFrame rdf{*tree};
// apply two cuts:
// - Track chisquare Y < 10 (inner tracker only fit)
// - At least 5 XY hits
// define the variable to be plotted
auto augmented_d =
rdf.Filter(
[](NAIA::TrTrackBaseData &trtrack) {
return trtrack.TrChiSq[NAIA::TrTrack::Fit::Kalman][NAIA::TrTrack::Span::InnerOnly][NAIA::TrTrack::Side::Y] < 10;
},
{"TrTrackBaseData"})
.Filter([](NAIA::TrTrackBaseData &trtrack) { return trtrack.LayerChargeXY.size() > 4; }, {"TrTrackBaseData"})
.Define("TrInnerCharge",
[](NAIA::TrTrackBaseData &trtrack) { return trtrack.InnerCharge[NAIA::TrTrack::ChargeRecoType::YJ]; },
{"TrTrackBaseData"});
// create a histogram of the tracker inner charge variable
auto chargeHist = augmented_d.Histo1D({"hh", "", 300, 0, 12}, "TrInnerCharge");
TCanvas *cc = new TCanvas();
chargeHist->DrawClone();
}
# USAGE
Note the usage of "TrTrackBaseData" rather than "TrTrackBase" (it's the name of the actual tree branch) and no read-on-demand (RDataFrame does this by default)
Also, you get speedups for free, since it's automatically multi-threaded (very useful for final-stage plots or quicklooks!)
void plot_innercharge() {
// enable multi-threading
ROOT::EnableImplicitMT();
TFile *infile = TFile::Open("/storage/gpfs_ams/ams/groups/AMS-Italy/ntuples/v0.0.1/ISS.B1130/pass7/1591361896.root");
TTree *tree = infile->Get<TTree>("NAIAChain");
ROOT::RDataFrame rdf{*tree};
// apply two cuts:
// - Track chisquare Y < 10 (inner tracker only fit)
// - At least 5 XY hits
// define the variable to be plotted
auto augmented_d =
rdf.Filter(
[](NAIA::TrTrackBaseData &trtrack) {
return trtrack.TrChiSq[NAIA::TrTrack::Fit::Kalman][NAIA::TrTrack::Span::InnerOnly][NAIA::TrTrack::Side::Y] < 10;
},
{"TrTrackBaseData"})
.Filter([](NAIA::TrTrackBaseData &trtrack) { return trtrack.LayerChargeXY.size() > 4; }, {"TrTrackBaseData"})
.Define("TrInnerCharge",
[](NAIA::TrTrackBaseData &trtrack) { return trtrack.InnerCharge[NAIA::TrTrack::ChargeRecoType::YJ]; },
{"TrTrackBaseData"});
// create a histogram of the tracker inner charge variable
auto chargeHist = augmented_d.Histo1D({"hh", "", 300, 0, 12}, "TrInnerCharge");
TCanvas *cc = new TCanvas();
chargeHist->DrawClone();
}
# COMMUNITY
NAIA has a simple landing page that provides quick and easy access to a manual and class documentation for each version (including changelogs and a reference on where the data is stored at CNAF)
# COMMUNITY
A simple user manual should guide you on the steps needed to install and use NAIA, and provides a quick reference on the datamodel and ideas behind NAIA.
(mostly what you have seen on these slides, but a bit more descriptive)
# COMMUNITY
A simple user manual should guide you on the steps needed to install and use NAIA, and provides a quick reference on the datamodel and ideas behind NAIA.
(mostly what you have seen on these slides, but a bit more descriptive)
And of course, for all the details on classes and methods and so on, there is doxygen page for every version.
# COMMUNITY
In addition you are kindly encouraged to join the AMS-Italy discord server.
This is particularly useful as a community gathering point to chat and discuss common activities, meetings, analysis and tools.
# COMMUNITY
Finally, for any bug, issue, or feature request for NAIA you can always go to the main repository on gitlab and open an issue to describe your needs.
There are already two templates: one for bug reporting, the other for feature requests.
# TOOLS
NAIA is just a data model and framework for AMS analysis. We can think of it as the foundational layer, but there is room for creating useful tools to further the data analysis experience.
We currently have in the works:
(This will be one of the topics for today's final discussion)
#