Varlociraptor:
Towards a unified
theory of variant calling
Johannes Köster
https://koesterlab.github.io
small
structural
SNVs, Indels
Inversions, Duplications, Indels, ...
?
?
?
?
somatic
germline
small
structural
germline
somatic
Delly
GATK
Freebayes
Platypus
Pindel
Mutect
Strelka
Manta
Octopus
Lancet
small
structural
germline
somatic
Delly
GATK
Freebayes
Platypus
Pindel
Mutect
Strelka
Lancet
Manta
Octopus
arbitrary
Model
\(\xi_i \sim \text{Bernoulli}(\alpha \theta_c \tau + (1-\alpha) \theta_h \tau)\)
\(\omega_i \sim Bernoulli(\pi_i)\)
\(Z_i \mid \xi_i, \omega_i=1 \sim\)
\(\beta\)
- allele frequency
- contamination
- sampling bias
- typing uncertainty
- strand bias
- mapping uncertainty
Model
\(\Pr(Z_i \mid \theta_c,\theta_h,\beta) =\)
⊕ Bayes' theorem
= posterior probability for events of interest (combinations of allele frequencies)
Events
Variant calling grammar
samples:
normal:
resolution: 5
universe: "0.0 | 0.5 | 1.0 | ]0.0,0.5["
tumor:
resolution: 100
universe: "[0.0,1.0]"
contamination:
by: normal
fraction: 0.25
events:
germline: "normal:0.5 | normal:1.0"
somatic_tumor: "normal:0.0 & tumor:]0.0,1.0]"
Application: tumor/normal
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/2:17:2 1/1:40:3
⊖
Problem:
accumulation of errors
More complex: so far...
Variant calling grammar
samples:
normal:
resolution: 5
universe: "0.0 | 0.5 | 1.0 | ]0.0,0.5["
tumor:
resolution: 100
universe: "[0.0,1.0]"
contamination:
by: normal
fraction: 0.25
relapse:
resolution: 100
universe: "[0.0,1.0]"
contamination:
by: normal
fraction: 0.53
events:
germline: "normal:0.5 | normal:1.0"
somatic_normal: "normal:]0.0,0.5["
somatic_tumor: "normal:0.0 & tumor:]0.0,1.0]"
somatic_relapse: "normal:0.0 & tumor:0.0 & relapse:]0.0,1.0]"
Application: tumor/normal/relapse
samples:
romeo:
universe:
all: "0.0 | 0.5 | 1.0"
X: "0.0 | 1.0"
Y: "0.0 | 1.0"
juliet:
universe:
all: "0.0 | 0.5 | 1.0"
Y: "0.0"
events:
both_het: "juliet:0.5 & romeo:0.5"
both_hom: "juliet:1.0 & romeo:1.0"
juliet_only: "(juliet:0.5 | juliet:1.0) & romeo:0.0"
romeo_only: "(romeo:0.5 | romeo:1.0) & juliet:0.0"
Variant calling grammar
Application: chromosome ploidy
samples:
tumor:
universe: "[0.0,1.0]"
resolution: 100
events:
ffpe_artifact: "(C>T | G>A) & tumor:]0.0,0.05["
present: "((C>T | G>A) & tumor:]0.05,1.0]) | (!(C>T | G>A) & tumor:]0.0,1.0[)"
Variant calling grammar
Application: FFPE artifacts
Variant calling grammar
Upcoming application: pan-calling
events:
present_in_any: "count(present) > 0"
recurrent: "count(present) >= 2"
Variant calling grammar
Upcoming application: pan-calling
events:
somatic: "somatic_tumor | somatic_relapse"
recurrent: "count(somatic_tumor | somatic_relapse) >= 2"
White box model
https://varlociraptor.github.io
Acknowledgements
Köster lab
David Lähnemann
Jan Forster
Schönhuth lab
Louis Dijkstra (alumn.)
Tobias Marschall (alumn.)
Software
Precision/Recall
Allele frequency estimation
Allele frequency estimation
FDR control
Concordance on real data
Varlociraptor: Towards a unified theory of variant calling
By Johannes Köster
Varlociraptor: Towards a unified theory of variant calling
Talk at WGGC-Meeting
- 1,354