Reproducible data processing
with Datalad: intro to
What // Why // How

Felix Hoffstaedter

 

Systems Medicine

Institute of Neuroscience and Medicine

Brain and Behaviour (INM-7)

Forschungszentrum Jülich

 

Institute of Systems Neuroscience

Heinrich-Heine University Düsseldorf

Germany

  1. jonas.github.io/tig
    
  2. handbook.datalad.org
  3. container Software

Needed for trying out examples

Singularity

Data

Results

Pipeline

DATA PROCESSING

 Code

Data

Results

Pipeline

DATA PROCESSING

 Code

- archived -

- archived -

Infrastructure

- changed -

I'll find the scripts,

give me a minute ...

Data

Results

Pipeline

Reproducible DATA PROCESSING

 Code

- archived -

- archived -

Containerization

Singularity

 Upload your Code

 share data

Data

Results

Pipeline

DATA PROCESSING

 Code

tracking changes in any set of files

Data

Results

Pipeline

DATA PROCESSING

 Code

Data

Containerized

 Code

* Clone everything *

*  everywhere *

* without *

* Data *

* content *

Reproducible DATA PROCESSING

Data

Results

Pipeline

DATA PROCESSING

 Code

Why Datalad?

  • Seeing everything without carrying everything
    • metadata without data content
  • Having different versions of the same data at once
    • metadata versioning relating to same data
  • Data carries their own (git) history
    • documentation is only as good as we make it
  • Dataset nesting relates datasets (raw-derivatives)

Why Datalad?

  • Seeing everything without carrying everything
    • metadata without data content

How?

Why Datalad?

  • Seeing everything without carrying everything
    • metadata without data content
# get datalad.datasets
datalad install ///openneuro

cd ds003097

datalad get -n .

How?

Why Datalad?

  • Seeing everything without carrying everything
    • metadata without data content
  • Having different versions of the same data at once
    • metadata versioning relating to same data
# let's BIDSify HCP data
datalad clone https://github.com/datalad-datasets/hcp-structural-preprocessed.git
cd hcp-structural-preprocessed
datalad run -m "BIDSify data" "./.bids_conversion/convert.sh"

Why Datalad?

  • Seeing everything without carrying everything
    • metadata without data content
  • Having different versions of the same data at once
    • metadata versioning relating to same data
  • Data carries their own (git) history
    • documentation is only as good as we make it
  • Dataset nesting relates datasets (raw-derivatives)

Why Datalad?

https://gin.g-node.org/

# let's get processed example data: VBM 

datalad clone https://gin.g-node.org/felixh/AOMIC_datasets

datalad get -n AOMIC_ID1000

cd AOMIC_ID1000
datalad get sub-0111/report/catreportj_sub-0111_run-3_T1w.jpg

Why Datalad?

https://gin.g-node.org/

# let's get processed example data: VBM 

datalad clone https://gin.g-node.org/felixh/AOMIC_datasets

datalad get -n AOMIC_ID1000

cd AOMIC_ID1000
datalad get sub-0111/report/catreportj_sub-0111_run-3_T1w.jpg

Important lessons

  • Code:
  • Data:
  • Processing issues: redo everything (really)
  • Results: script QC, look at 5% max
  • Statistics: redo QC fitting your method

framework SETUP

DataLad YouTube channel

FAIRly Big Workflow

Text

use case: large-scale medical data processing

  • Tested on different computing infrastructures
    • HPC system (inode limits) - JURECA | SLURM
    • HTC system (storage limits) - Juseless | HTCondor
  • Medical data under strict data usage constraints
  • MATLAB-based software component - CAT

42,715 participants

76 TB of data

43 milion of files

What is FAIR?

This image was created by Scriberia for The Turing Way community and is used under a CC-BY licence.

Wilkinson et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship, Sci. Data, doi: 10.1038/sdata.2016.18

results consolidation

  • Final consolidation of results! 🐙

Reproducible data processing with Datalad

By Felix Hoffstaedter

Reproducible data processing with Datalad

  • 94