Fachprojekt:
Reproduzierbare Datenanalyse mit Snakemake am Beispiel der Bioinformatik
Johannes Köster
University of Duisburg-Essen
Data Analysis
Data analysis
dataset
results
Steps:
- aggregation
- transformation
- filtration
- visualization
Via:
- command line tools
- small scripts
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
From raw data to final figures:
- document parameters, tools, versions
- execute without manual intervention
automation
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
Handle parallelization:
execute for tens to thousands of datasets
Avoid redundancy:
- when adding datasets
- when resuming from failures
automation
dataset
results
dataset
dataset
dataset
dataset
dataset
Handle deployment:
be able to easily execute analyses on a different system/platform/infrastructure
portability
scalability
automation
Reproducible data analysis
Bioinformatics
www.copdri.com 2016
Cells
www.austincc.edu 2016
Chromosomes
www.medicalxpress.com 2016
Genes
biosocialmethods.isr.umich.edu 2016
From genes to proteins
Transcript activity/expression:
the more RNA, the more protein
transcript
DNA:
- made of 4 nucleotides: Adenine, Guanine, Cytosine, Thymine
- a text over the alphabet \(\{A, C, G, T \}\)
Questions
- Which mutations does the genome of a patient contain?
- Which transcripts are particularly active/inactive with a certain disease?
- Which bacteria appear in a sample?
- Who is the murder?
Methods
- statistical modelling
- algorithm and data structure engineering
- machine learning
- data analysis
Here: Analysis of RNA-sequencing data
Illumina Inc. 2018
Illumina sequencing
en.wikipedia.org 2018
Illumina sequencing
en.wikipedia.org 2018
Illumina sequencing
en.wikipedia.org 2018
Result:
short, (paired-end) Reads
Transcript quantification with RNA-sequencing
en.wikipedia.org 2018
- obtain many short reads from RNA (50-100 million)
- map them against a reference genome
- quantify transcript expression by counting reads on each transcript
...ACGCTAGCAGCGTAGCGGAGCTATTGCGGAGCTGAGCGTATCGGAGAGATCGGATCTGGATCGAGATCTGAGCTGAGCTAGCTGGCTAGCGATCGGAGGAGCTAGCGATATTCGAGGAGGCGTATCGTAGC...
Gene and transcript sequence
CGGAGCTATTGCGG
GGAGCTATTGCGGA
GGATCGAGATCT
GGATCGAGATCT
CGGAGGAGCTAG
CGGAGGAGCTAG
TCGGAGGAGCTA
Semesterplan
Phase 1
Phase 2
Phase 3
11.10.2018 | Einführung, Snakemake-Tutorial |
18.10.2018 | Snakemake-Tutorial |
25.10.2018 | Snakemake-Tutorial |
01.11.2018 | Vorbereitung der Vorträge |
08.11.2018 | Vorbereitung der Vorträge |
15.11.2018 | Vorträge (je 30min) |
22.11.2018 | Implementierung des Workflows |
29.11.2018 | Implementierung des Workflows |
06.12.2018 | Implementierung des Workflows |
13.12.2018 | Implementierung des Workflows |
20.12.2018 | Implementierung des Workflows |
10.01.2019 | Implementierung des Workflows |
17.01.2019 | Implementierung des Workflows |
24.01.2019 | Vorbereitung der Abschlusspräsentationen |
31.01.2019 | Abschlusspräsentationen |
Phase 4
Fachprojekt: Reproduzierbare Datenanalyse mit Snakemake
By Johannes Köster
Fachprojekt: Reproduzierbare Datenanalyse mit Snakemake
- 2,085