Data analysis in paleogenetics with EAGER

Alexander Peltzer, June 27th 2017

Motivation

  • Large numbers of aDNA data created using new sequencing technologies (NGS)
  • Analysis rather difficult due to:
    • Size of datasets
    • Complexity of the datasets (sequencing errors, deamination)
    • Contamination of sequenced samples, libraries 

Size of DATA

(NextSeq 500, HiSeq 2000 are not even listed here anymore!)

Image by Illumina Inc. (c)

Complexity

How do you know what's a variant and what's an error?

Contamination

Contamination estimation is a key component for aDNA projects!

Renaud et al 2015

Motivation

  • Only few aDNA workflows/pipelines available
    • Mostly bash/perl/python scripts, difficult in application
    • Tools tailored for older methods: Sanger sequencing applications won't work with NGS data
  • Paleomix (Schubert et al 2014) one of the few exceptions

 

EAGER Pipeline overview 

EAGER: Focus

  • Automated workflow
  • Standard operating procedures for aDNA analysis projects
  • Usable and installable for non-bioinformatics experts

EAGER Features

  • RAW read processing, quality assessment of NGS data
  • Mapping methods (BWA, BWAmem, Bowtie2, Stampy)
  • Authentication (mapDamage, DamageProfiler)
  • Variant Calling & Filtering (angsd, GATK, ...)
  • Graphical user interface!

 

EAGER GUI

EAGER Features

  • Multiple sample mode: Execute same settings on multiple files
  • ReportTable: Provide reports of analysis runs
  • Statistics: Quality control, SNP Calling statistics, mapping results

 

Fail safe

  • Logfiles with errors, caveats
  • Tracked versions of tools (and EAGER)
  • Reproducible - you can check in your log file what happened!
  • Restart on error - if something breaks, restart it - EAGER picks up where it left!
# EAGER Version used for this run: 1.92.21
################
#CreateResultsDirectories at 2017-05-19T18:26:11.228 was executed with the following commandline:
mkdir -p /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample
_JK2968/0-FastQC/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Scr
eening/Sample_JK2968/1-AdapClip/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/
Mummies_WGS_Screening/Sample_JK2968/3-Mapper/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGE
R_Evaluation/Mummies_WGS_Screening/Sample_JK2968/4-Samtools/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Ru
ns_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/5-DeDup/.tmp /home/peltzer/palshare/peltzer/2017-05-
18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/6-QualiMap/.tmp /home/peltzer/palshare/p
eltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/7-DnaDamage/.tmp /home/pe
ltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/8-Preseq/
.tmp
################
## Runtime of Module was: 0 seconds.
################
#FastQCdefault at 2017-05-19T18:26:11.29 was executed with the following commandline:
fastqc -o /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sampl
e_JK2968/0-FastQC --extract  -f fastq /home/peltzer/palshare/peltzer/Mummies/RAW/2015-05-22_SequencingRun462/Sample_
JK2968/JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz /home/peltzer/palshare/peltzer/Mummies/RAW/2015-05-22_SequencingRu
n462/Sample_JK2968/JK2968_TGAAGGTCAGCAGA_L001_R2_001.fastq.gz
################
#Picked up JAVA_TOOL_OPTIONS: -Djava.io.tmpdir=/home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_E
valuation/Mummies_WGS_Screening/Sample_JK2968/0-FastQC/.tmp
Skipping '' which didn't exist, or couldn't be read
Started analysis of JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 5% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 10% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 15% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 20% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 25% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz

Additions

Specific tools for aDNA analysis

  • Dealing with low levels of aDNA specifically
  • Keep as much data as possible

Clip & MErge

  • Most aDNA projects work with paired end (PE) data with negative insert size
  • Problematic: Overestimation of coverage in overlap region
  • Merge these reads together, improve qualities in overlap regions

CircularMapper

  • For bacterial data (or mtDNA)
  • Improves results on circular genomes
  • "Extend & Split" Approach

DeDup

  • PCR duplicate removal for merged PE data
  • Problem: Samtools treats such data incorrectly
  • Solution: Take both 5' and 3' ends into account, check quality information

Other features

  • Docker support: Pipeline can be installed on (almost) any Linux workstation in <10 minutes

 

 

 

 

  • Documentation!

Documentation

Tutorials

  • Typical use cases
  • Examples to get familiar with the pipeline

Tutorial Videos

  • Setup tutorials (for other labs...)

Thanks for listening

(let's get some hands on experience)

Data analysis in paleogenetics with EAGER

By Alexander Peltzer

Data analysis in paleogenetics with EAGER

Standard Slides for small EAGER basic talk.

  • 1,591