Data analysis in paleogenetics with EAGER
Alexander Peltzer, June 27th 2017
Motivation
- Large numbers of aDNA data created using new sequencing technologies (NGS)
- Analysis rather difficult due to:
- Size of datasets
- Complexity of the datasets (sequencing errors, deamination)
- Contamination of sequenced samples, libraries
Size of DATA
(NextSeq 500, HiSeq 2000 are not even listed here anymore!)
Image by Illumina Inc. (c)
Complexity
How do you know what's a variant and what's an error?
Contamination
Contamination estimation is a key component for aDNA projects!
Renaud et al 2015
Motivation
- Only few aDNA workflows/pipelines available
- Mostly bash/perl/python scripts, difficult in application
- Tools tailored for older methods: Sanger sequencing applications won't work with NGS data
- Paleomix (Schubert et al 2014) one of the few exceptions
EAGER Pipeline overview
EAGER: Focus
- Automated workflow
- Standard operating procedures for aDNA analysis projects
- Usable and installable for non-bioinformatics experts
EAGER Features
- RAW read processing, quality assessment of NGS data
- Mapping methods (BWA, BWAmem, Bowtie2, Stampy)
- Authentication (mapDamage, DamageProfiler)
- Variant Calling & Filtering (angsd, GATK, ...)
- Graphical user interface!
EAGER GUI
EAGER Features
- Multiple sample mode: Execute same settings on multiple files
- ReportTable: Provide reports of analysis runs
- Statistics: Quality control, SNP Calling statistics, mapping results
Fail safe
- Logfiles with errors, caveats
- Tracked versions of tools (and EAGER)
- Reproducible - you can check in your log file what happened!
- Restart on error - if something breaks, restart it - EAGER picks up where it left!
# EAGER Version used for this run: 1.92.21
################
#CreateResultsDirectories at 2017-05-19T18:26:11.228 was executed with the following commandline:
mkdir -p /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample
_JK2968/0-FastQC/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Scr
eening/Sample_JK2968/1-AdapClip/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/
Mummies_WGS_Screening/Sample_JK2968/3-Mapper/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGE
R_Evaluation/Mummies_WGS_Screening/Sample_JK2968/4-Samtools/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Ru
ns_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/5-DeDup/.tmp /home/peltzer/palshare/peltzer/2017-05-
18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/6-QualiMap/.tmp /home/peltzer/palshare/p
eltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/7-DnaDamage/.tmp /home/pe
ltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/8-Preseq/
.tmp
################
## Runtime of Module was: 0 seconds.
################
#FastQCdefault at 2017-05-19T18:26:11.29 was executed with the following commandline:
fastqc -o /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sampl
e_JK2968/0-FastQC --extract -f fastq /home/peltzer/palshare/peltzer/Mummies/RAW/2015-05-22_SequencingRun462/Sample_
JK2968/JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz /home/peltzer/palshare/peltzer/Mummies/RAW/2015-05-22_SequencingRu
n462/Sample_JK2968/JK2968_TGAAGGTCAGCAGA_L001_R2_001.fastq.gz
################
#Picked up JAVA_TOOL_OPTIONS: -Djava.io.tmpdir=/home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_E
valuation/Mummies_WGS_Screening/Sample_JK2968/0-FastQC/.tmp
Skipping '' which didn't exist, or couldn't be read
Started analysis of JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 5% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 10% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 15% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 20% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 25% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Additions
Specific tools for aDNA analysis
- Dealing with low levels of aDNA specifically
- Keep as much data as possible
Clip & MErge
- Most aDNA projects work with paired end (PE) data with negative insert size
- Problematic: Overestimation of coverage in overlap region
- Merge these reads together, improve qualities in overlap regions
CircularMapper
- For bacterial data (or mtDNA)
- Improves results on circular genomes
- "Extend & Split" Approach
DeDup
- PCR duplicate removal for merged PE data
- Problem: Samtools treats such data incorrectly
- Solution: Take both 5' and 3' ends into account, check quality information
Other features
- Docker support: Pipeline can be installed on (almost) any Linux workstation in <10 minutes
- Documentation!
Documentation
Tutorials
- Online at eager.readthedocs.io
- ePUB/PDF (use your e-reader!)
- Tool explanation, tips, ...
- Typical use cases
- Examples to get familiar with the pipeline
Tutorial Videos
- Setup tutorials (for other labs...)
Thanks for listening
(let's get some hands on experience)
Data analysis in paleogenetics with EAGER
By Alexander Peltzer
Data analysis in paleogenetics with EAGER
Standard Slides for small EAGER basic talk.
- 1,737