NAIA

Ntuples for AMS-Italy Analysis

Introduction to the framework and data-format

Motivations

Resource optimization

With multiple groups producing each their own set of ntuples, lots of data is replicated on disk, which results in a waste of resources.

Also, groups are competing for computing resources for ntuple production.

1.

2.

Code exchange

Using the same data makes it easier to exchange selections and algorithms/procedures.

3.

Reproducibility / readability

Most often custom data formats are produced in a custom way, with a custom processing.

Additionally, in many cases, only the code owner can easily understand what's going in the analysis code.

# MOTIVATION

Driving principles

# PRINCIPLES
  • Don't throw anything out
    • This means processing and saving all the events from the original AMS-Root files 
  • Don't require network access
    • All the needed data should be inside NAIA files (e.g. no online access to RTI csv files on cvmfs)
  • Try to cover at least 90% of use-cases
    • Variable list comes from an internal survey including every analysis group
    • For special kind of analyses needing specialized variables, we plan to support user-defined Tree-friending 

Driving principles

# PRINCIPLES
  • Don't read what you don't need
    • Only perform I/O reads when variables are accessed. Allow to skip uninteresting events before branch reading even occurs.
  • Easy to understand
    • Code should be readable and expressive.
    • Variable name and usage should make clear what the intention of the programmer is, at least to an intuitive level.
    • Function names should be descriptive and hint at what the result of the function is.
  • Easy to use
    • Automatic installation for local development. CVMFS binary releases for usage on clusters.

Getting started

# STARTING

Requirements:

  • A C++ compiler with full C++14 support
    (tested with GCC 9.3.0 and higher)
  • CMake version 3.13 or higher
  • ROOT version 6.22 or higher compiled with C++14 support
    (suggested 6.26/02)

This mainly applies if you want to install NAIA on your personal machine. For distributed use (CNAF / CERN) all requirements and NAIA binaries are distributed via CVMFS

/cvmfs/ams.cern.ch/Offline/amsitaly/public/install/x86_64-centos7-gcc9.3/naia

and the correct environment can be setup with a dedicated script

/cvmfs/ams.cern.ch/Offline/amsitaly/public/install/x86_64-centos7-gcc9.3/naia/v1.0.0/setenvs/setenv_gcc6.26_cc7.sh 

Getting started

# STARTING

If you are building NAIA on your machine the installation is quite easy

# clone NAIA code 
git clone ssh://git@gitlab.cern.ch:7999/ams-italy/naia.git -b v1.0.1 # (clone via SSH)
# setup build and final install directories 
mkdir naia.build naia.install
# build NAIA
cd naia.build
cmake ../naia -DCMAKE_INSTALL_PREFIX=../naia.install
make all install

To use the NAIA ntuples your project will need:

  • the headers in naia.install/include
  • the naia.install/lib/libNAIAUtility.so library
  • the naia.install/lib/libNAIAContainers.so library
  • the naia.install/lib/libNAIAChain.so library

The NAIA data model

# DATA MODEL

Our data model starts with the NAIAChain object

This is the main way to open a NAIA rootfile, it will take care of loading all the relevant TTrees and setting up what we call the "read-on-demand" mechanism (more on this later)

 

Example:

// ...
#include "Chain/NAIAChain.h"

int main(int argc, char const *argv[]) {
  // Create a chain object
  NAIA::NAIAChain chain;
  // add one (or more) file to it
  chain.Add("somefile.root");
  // setup the read-on-demand mechanism // N.B: important and mandatory!
  chain.SetupBranches();
}

The NAIA data model

# DATA MODEL

Once your chain is created and ready to use, you can easily loop over all the events in the chain, with the help of the Event class

// ...
#include "Chain/NAIAChain.h"

int main(int argc, char const *argv[]) {

  NAIA::NAIAChain chain;
  chain.Add("somefile.root");
  chain.SetupBranches();

  // Event loop!
  for (Event& event : chain){
    // your analysis here :)
  }
}

(you can use the NAIAChain::GetEvent() method for index-based looping, if needed) 

The NAIA data model

# DATA MODEL

NAIA also provide a simple way of skimming a chain and only save interesting events in the output file 

// ...
#include "Chain/NAIAChain.h"

int main(int argc, char const *argv[]) {

  NAIA::NAIAChain chain;
  chain.Add("somefile.root");
  chain.SetupBranches();

  auto handle = chain.CreateSkimTree("skimmed.root", "");

  // Event loop!
  for (Event& event : chain){
    if( is_interesting(event) ){
      handle.Fill();
    }
  }
  
  handle.Write();
}

The NAIA data model

# DATA MODEL

NAIA also provide a simple way of skimming a chain and only save interesting events in the output file 

// ...
#include "Chain/NAIAChain.h"

int main(int argc, char const *argv[]) {

  NAIA::NAIAChain chain;
  chain.Add("somefile.root");
  chain.SetupBranches();

  auto handle = chain.CreateSkimTree("skimmed.root", "");

  // Event loop!
  for (Event& event : chain){
    if( is_interesting(event) ){
      handle.Fill();
    }
  }
  
  handle.Write();
}

We create a SkimTreeHandle from the original chain

The NAIA data model

# DATA MODEL

NAIA also provide a simple way of skimming a chain and only save interesting events in the output file 

// ...
#include "Chain/NAIAChain.h"

int main(int argc, char const *argv[]) {

  NAIA::NAIAChain chain;
  chain.Add("somefile.root");
  chain.SetupBranches();

  auto handle = chain.CreateSkimTree("skimmed.root", "");

  // Event loop!
  for (Event& event : chain){
    if( is_interesting(event) ){
      handle.Fill();
    }
  }
  
  handle.Write();
}

We can specify branches we're not interested in saving here, with a semicolon-separated list

The NAIA data model

# DATA MODEL

NAIA also provide a simple way of skimming a chain and only save interesting events in the output file 

// ...
#include "Chain/NAIAChain.h"

int main(int argc, char const *argv[]) {

  NAIA::NAIAChain chain;
  chain.Add("somefile.root");
  chain.SetupBranches();

  auto handle = chain.CreateSkimTree("skimmed.root", "");

  // Event loop!
  for (Event& event : chain){
    if( is_interesting(event) ){
      handle.Fill();
    }
  }
  
  handle.Write();
}

We fill it when we find an interesting event

The NAIA data model

# DATA MODEL

NAIA also provide a simple way of skimming a chain and only save interesting events in the output file 

// ...
#include "Chain/NAIAChain.h"

int main(int argc, char const *argv[]) {

  NAIA::NAIAChain chain;
  chain.Add("somefile.root");
  chain.SetupBranches();

  auto handle = chain.CreateSkimTree("skimmed.root", "");

  // Event loop!
  for (Event& event : chain){
    if( is_interesting(event) ){
      handle.Fill();
    }
  }
  
  handle.Write();
}

When we're done we write it to disk

The NAIA data model

# DATA MODEL

The Event class is probably the most important one, but also the most boring since it's basically a proxy class containing a collection of Containers

 

Containers are the real building blocks of the NAIA datamodel.

 

Each container is associated to a single branch in the main TTree and allows for reading the corresponding branch data only when first accessed.

(This means that if you never use a particular container in your analysis, you’ll never read the corresponding data from file)

The NAIA data model

# DATA MODEL

Container is the general term to define a class in the NAIA data model that groups several variables, according to specific criteria (e.g. all the variables related to the TOF).

 

Most containers come in two variants: the Base and the Plus variant

The Base variant contains variables that are accessed by almost every analysis or that are accessed most often

The Plus variant contains variables that won't be needed by everyone, or may be needed less frequently

 

The important thing is that you always access variables using the -> operator on the container. This is how the read-on-demand is implemented and leads to wrong results otherwise.

The NAIA data model

# DATA MODEL

Variables in NAIA are a bit more complex, for valid reasons:

 

We want our data model to be as light as possible (especially since we're processing and saving every single event)

 

This implies that if something's missing we don't want to write anything to disk

  • e.g. if there's no hit on Tracker L1 we don't want to write 0, or -9999 or whatever sentinel value to keep track of this. We don't want to write anything at all.

We achieve this by using associative containers (mostly std::map) and realizing that in many cases there are patterns we can exploit.

 

Example: several variables come in "flavors" or are computed by different reconstructions 

The NAIA data model

# DATA MODEL

Doing AMS analysis means constantly dealing with "one value for each X type", where X could be a charge reconstruction method, track fitting algorithm, ECAL BDT estimator, and so on…

 

Example: For tracker hits, there are three available charge reconstruction methods: STD, Hu Liu, Yi Jia.

In general, these reconstructions are not guaranteed always to succeed

We handle those by defining the following:

// one number per charge reconstruction type
template <class T> using TrackChargeVariable = std::map<TrTrack::ChargeRecoType, T>;

Where TrTrack::ChargeRecoType is an enum with three values

enum ChargeRecoType {
  STD, ///< Standard tracker charge reconstruction
  HL,  ///< Hu Liu reconstruction
  YJ,  ///< Yi Jia reconstruction
};

The NAIA data model

# DATA MODEL

Why an enum?!?

Well... what is more readable and understandable

float inner_charge = event.trTrackBase->InnerCharge[2];

or

float inner_charge = event.trTrackBase->InnerCharge[TrTrack::ChargeRecoType::YJ];

Readability debates aside, this avoids the confusion usually brought by magic numbers (you might not remember after a few weeks that Yi Jia reconstruction is at index 2, for example)

In addition, if this ever changes in the future, and it is moved to index 3 you won't have to modify your code in the second case. 

 

For this reason almost every variable structure in NAIA is accessed using specific enums.

The NAIA data model

# DATA MODEL

Some variables have nested structures, for example in TrTrackPlus we have:

///< Track hit charge (X and Y-side) for each layer, for each charge reconstruction.
LayerVariable<TrackChargeVariable<TrackSideVariable<float>>> LayerCharge; 

where for each layer, then for each reconstruction, then for each side we store a number.

But it is not guaranteed that the track will have a hit on, say, Layer 1. Or that the underlying cluster is correctly identified on the X side. How do we check for this?

For this, there is a dedicated ContainsKeys function, which checks if the desired elements (identified by some keys, i.e. the aforementioned enums) exist in the structure

if (ContainsKeys(event.trTrackPlus->LayerCharge, layer_idx, Track::ChargeRecoType::YJ, TrTrack::Side::X)) {
  // do stuff...
}

you can find the full list of variable structures here

The NAIA data model

# DATA MODEL

One quick way of discarding uninteresting events without reading almost anything is by using the event mask.

The mask is simply a bitmask where every bit represents a particular Category. If the event satisfies a given Category, the corresponding bit in the mask will be set.

 

// ...
#include "Chain/NAIAChain.h"

int main(int argc, char const *argv[]) {

  NAIA::NAIAChain chain;
  chain.Add("somefile.root");
  chain.SetupBranches();

  auto handle = chain.CreateSkimTree("skimmed.root", "");

  // Event loop!
  for (Event& event : chain){
    if( event.CheckMask(NAIA::Category::Charge1_Tof | NAIA::Category::Charge1_Trk) ){
      handle.Fill();
    }
  }
  
  handle.Write();
}

The NAIA data model

# DATA MODEL

To do analysis you don't need only Events, but also information about livetime, or amount of generated MC events...

Each NAIA file contains two additional trees just for that

One contains data about the ISS position, its orientation, and physical quantities connected to them, as well as some time-averaged data about the run itself. This kind of data is usually retrieved in AMS analysis from the RTI (Real Time Information) database. This database stores data with a time granularity of one second, and it can be accessed using the gbatch library.

We don't want any dependency on gbatch so the entire RTI database is converted to a TTree that has only one branch, which contains objects of the RTIInfo class, one for each second of the current run.

// inside the event loop
// Get the RTI info for the current event
NAIA::RTIInfo &rti_info = chain.GetEventRTIInfo();

The NAIA data model

# DATA MODEL

We don't want any dependency on gbatch so the entire RTI database is converted to a TTree that has only one branch, which contains objects of the RTIInfo class, one for each second of the current run.

The tree can be accessed from outside the event loop as well

TChain* rti_chain = chain.GetRTITree();
NAIA::RTIInfo* rti_info = new RTIInfo();
rti_chain->SetBranchAddress("RTIInfo", &rti_info);

for(unsigned long long isec=0; isec < rti_chain->GetEntries(); ++isec){
  rti_chain->GetEntry(isec);

  // your analysis here :)
}

Clearly useful if you only have to just recompute livetime or analyse only RTI data

The NAIA data model

# DATA MODEL

The second tree contains useful information about the original AMSRoot file from which the current NAIA file was derived.

This information is stored in the FileInfo TTree, which usually has only a single entry for each NAIA root-file.

(Having this data in a TTree allows us to chain multiple NAIA root-files and still be able to retrieve the FileInfo data for the current run we’re processing)

This tree has one branch, which contains objects of the FileInfo class and, if the NAIA root-file is a Montecarlo file, an additional branch containing objects of the MCFileInfo class.

// inside the event loop
// Get the File infos for the current event
NAIA::FileInfo &file_info = chain.GetEventFileInfo();
// if this is a MC file
NAIA::MCFileInfo &mcfile_info = chain.GetEventMCFileInfo();

The NAIA data model

# DATA MODEL

Also in this case the tree can be accessed from outside the event loop as well

TChain* file_chain = chain.GetFileInfoTree();
NAIA::FileInfo* file_info = new NAIA::FileInfo();
NAIA::MCFileInfo* mcfile_info = new NAIA::MCFileInfo();

file_chain->SetBranchAddress("FileInfo", &file_info);
if(chain.IsMC()){
  file_chain->SetBranchAddress("MCFileInfo", &mcfile_info);
}

for(unsigned long long i=0; i < file_chain->GetEntries(); ++i){
  file_chain->GetEntry(i);

  // do stuff with file_info

  if(chain.IsMC()){
    // do stuff with mcfile_info
  }
}

Using NAIA in your analysis

# USAGE

There are some examples in NAIA that should guide you in building your analysis with NAIA. These are divided by usage type:

Using NAIA in your analysis

# USAGE

There are some examples in NAIA that should guide you in building your analysis with NAIA. These are divided by usage type:

CMake: (recommended)

It is a powerful cross-platform build system that is used to specify in a generic way how programs should be compiled and generate the corresponding Makefiles

It is especially useful when your project makes use of other packages / libraries that need to be imported

# CMakeLists.txt:
project(testNAIA)
set(CMAKE_CXX_STANDARD 14)

find_package(NAIA 1.0.1 REQUIRED)

add_executable(main src/main.cpp)
target_link_libraries(main PUBLIC NAIA::NAIAChain)

it becomes quite effective when the size of the project increases (many executables/libraries)

Using NAIA in your analysis

# USAGE
# CMakeLists.txt:
project(testNAIA)
set(CMAKE_CXX_STANDARD 14)

find_package(NAIA 1.0.1 REQUIRED)

add_executable(main src/main.cpp)
target_link_libraries(main PUBLIC NAIA::NAIAChain)

Note: following good CMake practices, NAIA internally defines everything that is needed in terms of targets.

The NAIA::NAIAChain target internally knows all the include paths, preprocessor macros, library paths, libraries that it needs so that CMake can propagate these requirements to all targets linking against NAIA::NAIAChain, such as your own library targets or executables.

To compile just run

mkdir build
cd build
cmake .. -DNAIA_DIR=${path_to_your_naia_install}/cmake
make

Using NAIA in your analysis

# USAGE

Makefile

If you want to write your own Makefile you can take a look at the existing examples and update the NAIA_DIR variable inside. Remember to add include paths/libraries if needed or if something changes between NAIA versions.

NAIA_DIR=/path/to/your/naia/install

CC = g++
CFLAGS = $(shell root-config --cflags) -DSPDLOG_FMT_EXTERNAL
INCLUDES = -I$(NAIA_DIR)/include -I./include
LFLAGS = $(shell root-config --libs) -L $(NAIA_DIR)/lib -Wl,-rpath=$(NAIA_DIR)/lib
LIBS = -lNAIAUtility -lNAIAChain -lNAIAContainers

SRCS = src/main.cpp
OBJS = $(SRCS:.cpp=.o)

MAIN = main

.PHONY: depend clean

all:    $(MAIN)
	@echo  main has been compiled

$(MAIN): $(OBJS) 
	$(CC) $(CFLAGS) $(INCLUDES) -o $(MAIN) $(OBJS) $(LFLAGS) $(LIBS)

.cpp.o:
	$(CC) $(CFLAGS) $(INCLUDES) -c $<  -o $@

clean:
	$(RM) *.o src/*.o *~ $(MAIN)

depend: $(SRCS)
	makedepend $(INCLUDES) $^

# DO NOT DELETE THIS LINE -- make depend needs it

Using NAIA in your analysis

# USAGE

ROOT macros:

In this case the libraries and include paths are setup by a dedicated macro

// load.C: 
{
  TString naia_dir = "/path/to/your/naia/install/dir";
  gROOT->ProcessLine(".include" + naia_dir + "/include");
  gSystem->SetDynamicPath(naia_dir + "/lib:" + gSystem->GetDynamicPath());
  gSystem->Load("libNAIAUtility");
  gSystem->Load("libNAIAContainers");
  gSystem->Load("libNAIAChain");

  gROOT->ProcessLine("#define SPDLOG_FMT_EXTERNAL");
  gROOT->ProcessLine("#include \"Chain/NAIAChain.h\"");
}

and then run as 

root load.C main.cpp

Using NAIA in your analysis

# USAGE

Bonus: RDataFrame

Sometimes you do need to make a quick plot and macros are just for that. One very cool option could be to use RDataFrame. You won't use a NAIAChain, in this case, you're working directly with the naked branches (and you still need load.C)

void plot_innercharge() {
  // enable multi-threading
  ROOT::EnableImplicitMT();

  TFile *infile = TFile::Open("/storage/gpfs_ams/ams/groups/AMS-Italy/ntuples/v0.0.1/ISS.B1130/pass7/1591361896.root");
  TTree *tree = infile->Get<TTree>("NAIAChain");
  ROOT::RDataFrame rdf{*tree};

  // apply two cuts:
  // - Track chisquare Y < 10 (inner tracker only fit)
  // - At least 5 XY hits
  // define the variable to be plotted
  auto augmented_d =
      rdf.Filter(
             [](NAIA::TrTrackBaseData &trtrack) {
               return trtrack.TrChiSq[NAIA::TrTrack::Fit::Kalman][NAIA::TrTrack::Span::InnerOnly][NAIA::TrTrack::Side::Y] < 10;
             },
             {"TrTrackBaseData"})
          .Filter([](NAIA::TrTrackBaseData &trtrack) { return trtrack.LayerChargeXY.size() > 4; }, {"TrTrackBaseData"})
          .Define("TrInnerCharge",
                  [](NAIA::TrTrackBaseData &trtrack) { return trtrack.InnerCharge[NAIA::TrTrack::ChargeRecoType::YJ]; },
                  {"TrTrackBaseData"});

  // create a histogram of the tracker inner charge variable
  auto chargeHist = augmented_d.Histo1D({"hh", "", 300, 0, 12}, "TrInnerCharge");

  TCanvas *cc = new TCanvas();
  chargeHist->DrawClone();
}

Using NAIA in your analysis

# USAGE

Note the usage of "TrTrackBaseData" rather than "TrTrackBase" (it's the name of the actual tree branch) and no read-on-demand (RDataFrame does this by default)

Also, you get speedups for free, since it's automatically multi-threaded (very useful for final-stage plots or quicklooks!)

void plot_innercharge() {
  // enable multi-threading
  ROOT::EnableImplicitMT();

  TFile *infile = TFile::Open("/storage/gpfs_ams/ams/groups/AMS-Italy/ntuples/v0.0.1/ISS.B1130/pass7/1591361896.root");
  TTree *tree = infile->Get<TTree>("NAIAChain");
  ROOT::RDataFrame rdf{*tree};

  // apply two cuts:
  // - Track chisquare Y < 10 (inner tracker only fit)
  // - At least 5 XY hits
  // define the variable to be plotted
  auto augmented_d =
      rdf.Filter(
             [](NAIA::TrTrackBaseData &trtrack) {
               return trtrack.TrChiSq[NAIA::TrTrack::Fit::Kalman][NAIA::TrTrack::Span::InnerOnly][NAIA::TrTrack::Side::Y] < 10;
             },
             {"TrTrackBaseData"})
          .Filter([](NAIA::TrTrackBaseData &trtrack) { return trtrack.LayerChargeXY.size() > 4; }, {"TrTrackBaseData"})
          .Define("TrInnerCharge",
                  [](NAIA::TrTrackBaseData &trtrack) { return trtrack.InnerCharge[NAIA::TrTrack::ChargeRecoType::YJ]; },
                  {"TrTrackBaseData"});

  // create a histogram of the tracker inner charge variable
  auto chargeHist = augmented_d.Histo1D({"hh", "", 300, 0, 12}, "TrInnerCharge");

  TCanvas *cc = new TCanvas();
  chargeHist->DrawClone();
}

Support and community

# COMMUNITY

NAIA has a simple landing page that provides quick and easy access to a manual and class documentation for each version (including changelogs and a reference on where the data is stored at CNAF)

Support and community

# COMMUNITY

A simple user manual should guide you on the steps needed to install and use NAIA, and provides a quick reference on the datamodel and ideas behind NAIA.

(mostly what you have seen on these slides, but a bit more descriptive)

Support and community

# COMMUNITY

A simple user manual should guide you on the steps needed to install and use NAIA, and provides a quick reference on the datamodel and ideas behind NAIA.

(mostly what you have seen on these slides, but a bit more descriptive)

And of course, for all the details on classes and methods and so on, there is doxygen page for every version.

Support and community

# COMMUNITY

In addition you are kindly encouraged to join the AMS-Italy discord server.

This is particularly useful as a community gathering point to chat and discuss common activities, meetings, analysis and tools.

Support and community

# COMMUNITY

Finally, for any bug, issue, or feature request for NAIA you can always go to the main repository on gitlab and open an issue to describe your needs.

There are already two templates: one for bug reporting, the other for feature requests. 

Tools in the work

# TOOLS

NAIA is just a data model and framework for AMS analysis. We can think of it as the foundational layer, but there is room for creating useful tools to further the data analysis experience.

We currently have in the works:

  • A NAIA adapter for ROOT's TSelector framework (NaiaTSelector)
  • A common selection library (NSL)
  • A ROOT-based spline fitting library (RSpline)
  • A rewrite of the plugin system initially proposed for the dbar analysis

(This will be one of the topics for today's final discussion)

Happy coding :)

# 
Made with Slides.com