Data analysis in paleogenetics with EAGER

Alexander Peltzer, June 27th 2017

Motivation

  • Large numbers of aDNA data created using new sequencing technologies (NGS)
  • Analysis rather difficult due to:
    • Size of datasets
    • Complexity of the datasets (sequencing errors, deamination)
    • Contamination of sequenced samples, libraries 

Size of DATA

(NextSeq 500, HiSeq 2000 are not even listed here anymore!)

Image by Illumina Inc. (c)

Complexity

How do you know what's a variant and what's an error?

Contamination

Contamination estimation is a key component for aDNA projects!

Renaud et al 2015

Motivation

  • Only few aDNA workflows/pipelines available
    • Mostly bash/perl/python scripts, difficult in application
    • Tools tailored for older methods: Sanger sequencing applications won't work with NGS data
  • Paleomix (Schubert et al 2014) one of the few exceptions

 

EAGER Pipeline overview 

EAGER: Focus

  • Automated workflow
  • Standard operating procedures for aDNA analysis projects
  • Usable and installable for non-bioinformatics experts

EAGER Features

  • RAW read processing, quality assessment of NGS data
  • Mapping methods (BWA, BWAmem, Bowtie2, Stampy)
  • Authentication (mapDamage, DamageProfiler)
  • Variant Calling & Filtering (angsd, GATK, ...)
  • Graphical user interface!

 

EAGER GUI

EAGER Features

  • Multiple sample mode: Execute same settings on multiple files
  • ReportTable: Provide reports of analysis runs
  • Statistics: Quality control, SNP Calling statistics, mapping results

 

Fail safe

  • Logfiles with errors, caveats
  • Tracked versions of tools (and EAGER)
  • Reproducible - you can check in your log file what happened!
  • Restart on error - if something breaks, restart it - EAGER picks up where it left!
# EAGER Version used for this run: 1.92.21
################
#CreateResultsDirectories at 2017-05-19T18:26:11.228 was executed with the following commandline:
mkdir -p /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample
_JK2968/0-FastQC/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Scr
eening/Sample_JK2968/1-AdapClip/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/
Mummies_WGS_Screening/Sample_JK2968/3-Mapper/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGE
R_Evaluation/Mummies_WGS_Screening/Sample_JK2968/4-Samtools/.tmp /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Ru
ns_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/5-DeDup/.tmp /home/peltzer/palshare/peltzer/2017-05-
18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/6-QualiMap/.tmp /home/peltzer/palshare/p
eltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/7-DnaDamage/.tmp /home/pe
ltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sample_JK2968/8-Preseq/
.tmp
################
## Runtime of Module was: 0 seconds.
################
#FastQCdefault at 2017-05-19T18:26:11.29 was executed with the following commandline:
fastqc -o /home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_Evaluation/Mummies_WGS_Screening/Sampl
e_JK2968/0-FastQC --extract  -f fastq /home/peltzer/palshare/peltzer/Mummies/RAW/2015-05-22_SequencingRun462/Sample_
JK2968/JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz /home/peltzer/palshare/peltzer/Mummies/RAW/2015-05-22_SequencingRu
n462/Sample_JK2968/JK2968_TGAAGGTCAGCAGA_L001_R2_001.fastq.gz
################
#Picked up JAVA_TOOL_OPTIONS: -Djava.io.tmpdir=/home/peltzer/palshare/peltzer/2017-05-18_Thesis_Runs_Peltzer/EAGER_E
valuation/Mummies_WGS_Screening/Sample_JK2968/0-FastQC/.tmp
Skipping '' which didn't exist, or couldn't be read
Started analysis of JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 5% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 10% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 15% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 20% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz
Approx 25% complete for JK2968_TGAAGGTCAGCAGA_L001_R1_001.fastq.gz

Additions

Specific tools for aDNA analysis

  • Dealing with low levels of aDNA specifically
  • Keep as much data as possible

Clip & MErge

  • Most aDNA projects work with paired end (PE) data with negative insert size
  • Problematic: Overestimation of coverage in overlap region
  • Merge these reads together, improve qualities in overlap regions

CircularMapper

  • For bacterial data (or mtDNA)
  • Improves results on circular genomes
  • "Extend & Split" Approach

DeDup

  • PCR duplicate removal for merged PE data
  • Problem: Samtools treats such data incorrectly
  • Solution: Take both 5' and 3' ends into account, check quality information

Other features

  • Docker support: Pipeline can be installed on (almost) any Linux workstation in <10 minutes

 

 

 

 

  • Documentation!

Documentation

Tutorials

  • Typical use cases
  • Examples to get familiar with the pipeline

Tutorial Videos

  • Setup tutorials (for other labs...)

Thanks for listening

(let's get some hands on experience)