Felix Hoffstaedter
Systems Medicine
Institute of Neuroscience and Medicine
Brain and Behaviour (INM-7)
Forschungszentrum Jülich
Institute of Systems Neuroscience
Heinrich-Heine University Düsseldorf
Germany
DataLad Team &
Contributors
Christian Gaser &
Robert Dahnke
Voxel-Based Morphometry
Computational Anatomy Toolbox
Why Large Scale Neuroscience?
Reproducible Data Processing
What's DataLad anyway
The FAIRly Big Workflow: Bootstrap-Execute-Consolidate
Important lessons
EXECUTE the following to create and publish the Workflow dataset:
# clone Repo wit Workflow bootstrap scripts and build workflow
git clone git@jugit.fz-juelich.de:f.hoffstaedter/lsn_cat12.git
./lsn_cat12/LSN_Tutorial_bootstrap_CAT-MCR_AOMIC-PIOP2.sh
# start processing of 3 subjects in parallel
cd AOMIC-PIOP2_LSN_cat12.8/; ./code/process.sub sub-0001 &
./code/process.sub sub-0012 &; ./code/process.sub sub-0123;
# consolidate results and publish dataset to OSF
./code/results.merger
datalad get *
datalad create-sibling-osf --title LSN_Tutorial_CAT-MCR_AOMIC-PIOP2 -s osf
datalad push --to osfEXECUTE the following to create and publish the Workflow dataset:
# clone Repo wit Workflow bootstrap scripts and build workflow
git clone git@jugit.fz-juelich.de:f.hoffstaedter/lsn_cat12.git
./lsn_cat12/LSN_Tutorial_bootstrap_CAT-MCR_AOMIC-PIOP2_mac.sh
# start processing of 3 subjects in parallel
cd AOMIC-PIOP2_LSN_cat12.8/; ./code/process.sub sub-0001 &
./code/process.sub sub-0012 &; ./code/process.sub sub-0123;
# consolidate results and publish dataset to OSF
./code/results.merger; datalad get *
datalad create-sibling-osf --title LSN_Tutorial_CAT-MCR_AOMIC-PIOP2 -s osf
datalad push --to osfwhen you shouldn't look at all datasets (n>=200)
when frequent reprocessing is not desirable
when the data is growing in size over time
when the data is used in other projects
when you actually want to share data & results
Data
Results
Pipeline
Code
Data
Results
Pipeline
Code
- archived -
- archived -
Infrastructure
- changed -
I'll find the scripts,
give me a minute ...
Data
Results
Pipeline
Code
- archived -
- archived -
Containerization
Singularity
Upload your Code
share data
Data
Results
Pipeline
Code
Data
Results
Pipeline
Code
tracking changes in any set of files
Data
Results
Pipeline
Code
Data
Containerized
Code
* Clone everything *
* everywhere *
* without *
* Data *
* content *
Data
Code
* quick'n'dirty *
Matlab Compiler Runtime
# project name space SAMPLE to be processed
PROJECT="LSN_cat12.8"
SAMPLE="AOMIC-PIOP2"
CWD=$(pwd) # get current work dir
### define the input RIA-store only to clone from
input_store="ria+file://${CWD}/inputstore"
### define the output RIA-store to push all results to
output_store="ria+file://${CWD}/dataladstore"
### define the location of the store all analysis inputs will be obtained from
raw_store="https://github.com/OpenNeuroDatasets/ds002790.git"
### define the container store -- container_store="XXX"
### define the temporary working directory to clone and process each subject on
temporary_store=/tmp
input_store: empty Workflow to clone from
output_store: storage to push Results to
raw_store: get raw Data from
container_store:
datalad create -c yoda AOMIC-PIOP2_LSN_cat12.8
cd AOMIC-PIOP2_LSN_cat12.8CAT="CAT12.8.1_r1980_R2017b_MCR_Linux"
SPM="http://www.neuro.uni-jena.de/cat12"
datalad run -m "download ${CAT} standalone version for Linux" \
"wget ${SPM}/${CAT}.zip; unzip -d code ${CAT}.zip; rm -f ${CAT}.zip;"
.
├── .gitattributes
├── CHANGELOG.md
├── code
│ ├── .gitattributes
│ └── README.md
└── README.md
datalad clone -d . ${raw_store} inputs/${SAMPLE}
git commit --amend -m "Register ${SAMPLE} BIDS dataset as input"
datalad create-sibling-ria -s ${PROJECT}_in "${input_store}" --new-store-ok
datalad create-sibling-ria -s ${PROJECT}_out "${output_store}" --new-store-okinput_store
output_store
# the actual compute job specification
cat > code/participant_job << EOT
#!/bin/bash
...
EOT
chmod +x code/participant_job
datalad save -m "Participant compute job implementation"
cat > code/process.sub << EOT
#!/bin/bash
# the job expects these environment variables for labeling and synchronization
...
EOT
chmod +x code/process.sub
datalad save -m "individual job submission"
# the logfiles folder is to be ignored by git
mkdir logs
echo logs >> .gitignore
cat > code/process.condor_submit << EOT
universe = vanilla
# resource requirements for each job
request_cpus = 1
request_memory = 4G
request_disk = 5G
# tell condor that a job is self contained and the executable
# is enough to bootstrap the computation on the execute node
...
cat > code/process.sbatch << EOT
#!/bin/bash -x
#SBATCH --account=runthings
#SBATCH --time=24:00:00
#SBATCH --job-name=FAIRlyBig
...
git remote -vLSN_cat12.8_in /home/DATA/inputstore/147/7683f-18a7-4c59-9e90-b1027865d0a2 (fetch)
LSN_cat12.8_in /home/DATA/inputstore/147/7683f-18a7-4c59-9e90-b1027865d0a2 (push)
LSN_cat12.8_in-storage
LSN_cat12.8_out /home/DATA/dataladstore/147/7683f-18a7-4c59-9e90-b1027865d0a2 (fetch)
LSN_cat12.8_out /home/DATA/dataladstore/147/7683f-18a7-4c59-9e90-b1027865d0a2 (push)
LSN_cat12.8_out-storage
AOMIC-PIOP2_LSN_cat12.8
├── CHANGELOG.md
├── code
│ ├── CAT12.8.1_r1980_R2017b_MCR_Linux
│ ├── cat_standalone_segment_enigmaTEST.m
│ ├── finalize_job_outputs_ENIGMA.sh
│ ├── participant_job
│ ├── process.condor_dag
│ ├── process.condor_submit
│ ├── process.sub
│ ├── README.md
│ └── results.merger
├── inputs
│ └── AOMIC-PIOP2
├── logs
└── README.md
input_store
output_store
temporal workdir
git merge -m "Merge results" $(git branch -al | grep 'job-' | tr -d ' ')
# clean git annex branch
git annex fsck -f LSN_cat12.8_out-storage
# declare local data clone as dead
git annex dead here
# datalad push merged results
datalad push --data nothing --to LSN_cat12.8_out
output_store
└── sub-0123
├── inforoi.tar.gz -> ../.git/annex/objects/4F/m9/MD5E-s584388--4adb0751878d8095a958b31a31589402.tar.gz/MD5E-s584388--4adb0751878d8095a958b31a31589402.tar.gz
├── native.tar.gz -> ../.git/annex/objects/J9/G7/MD5E-s12808637--50f4a108e3ff1d20fbb4b9101f57462e.tar.gz/MD5E-s12808637--50f4a108e3ff1d20fbb4b9101f57462e.tar.gz
├── surface.tar.gz -> ../.git/annex/objects/4g/x0/MD5E-s45--0791f35d8dde0bd16669b238b35eb389.tar.gz/MD5E-s45--0791f35d8dde0bd16669b238b35eb389.tar.gz
└── vbm.tar.gz -> ../.git/annex/objects/wk/gp/MD5E-s17103529--8108c63f231d80d7b7a034c26d70b8e4.tar.gz/MD5E-s17103529--8108c63f231d80d7b7a034c26d70b8e4.tar.gz
datalad clone osf://g2rmn LSN_CAT12.8_AOMIC-PIOP2
datalad clone osf://3w2zq LSN_CAT12.8_AOMIC-PIOP1Text
42,715 participants
76 TB of data
43 milion of files
This image was created by Scriberia for The Turing Way community and is used under a CC-BY licence.
Wilkinson et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship, Sci. Data, doi: 10.1038/sdata.2016.18
DataLad YouTube channel