Data analysis in paleogenetics with EAGER
Alexander Peltzer, June 27th 2017
- Large numbers of aDNA data created using new sequencing technologies (NGS)
- Analysis rather difficult due to:
- Size of datasets
- Complexity of the datasets (sequencing errors, deamination)
- Contamination of sequenced samples, libraries
Size of DATA
(NextSeq 500, HiSeq 2000 are not even listed here anymore!)
How do you know what's a variant and what's an error?
Contamination estimation is a key component for aDNA projects!
- Only few aDNA workflows/pipelines available
- Mostly bash/perl/python scripts, difficult in application
- Tools tailored for older methods: Sanger sequencing applications won't work with NGS data
- Paleomix (Schubert et al 2014) one of the few exceptions
EAGER Pipeline overview
EAGER: Focus
- Automated workflow
- Standard operating procedures for aDNA analysis projects
- Usable and installable for non-bioinformatics experts
EAGER Features
- RAW read processing, quality assessment of NGS data
- Mapping methods (BWA, BWAmem, Bowtie2, Stampy)
- Authentication (mapDamage, DamageProfiler)
- Variant Calling & Filtering (angsd, GATK, ...)
- Graphical user interface!
EAGER Features
- Multiple sample mode: Execute same settings on multiple files
- ReportTable: Provide reports of analysis runs
- Statistics: Quality control, SNP Calling statistics, mapping results
Fail safe
- Logfiles with errors, caveats
- Tracked versions of tools (and EAGER)
- Reproducible - you can check in your log file what happened!
- Restart on error - if something breaks, restart it - EAGER picks up where it left!
Specific tools for aDNA analysis
- Dealing with low levels of aDNA specifically
- Keep as much data as possible
Clip & MErge
- Most aDNA projects work with paired end (PE) data with negative insert size
- Problematic: Overestimation of coverage in overlap region
- Merge these reads together, improve qualities in overlap regions
- For bacterial data (or mtDNA)
- Improves results on circular genomes
- "Extend & Split" Approach
- PCR duplicate removal for merged PE data
- Problem: Samtools treats such data incorrectly
- Solution: Take both 5' and 3' ends into account, check quality information
Other features
- Docker support: Pipeline can be installed on (almost) any Linux workstation in <10 minutes
- Documentation!
- Online at
- ePUB/PDF (use your e-reader!)
- Tool explanation, tips, ...
- Typical use cases
- Examples to get familiar with the pipeline
Tutorial Videos
- Setup tutorials (for other labs...)
Thanks for listening
(let's get some hands on experience)
