Varlociraptor: Towards a unified theory of variant calling
Johannes Köster
2021
small
structural
SNVs, Indels
Inversions, Duplications, Indels, ...
?
?
?
?
somatic
germline
Variant calling
small
structural
?
?
somatic
germline
Variant calling
tumor/normal
sample, cohort, pedigree
?
?
small
structural
germline
somatic
GATK
Freebayes
Platypus
Pindel
Mutect
Strelka
Octopus
Lancet
Variant calling
Delly
Manta
Gridss
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/2:17:2 1/1:40:3
⊖
Problem:
accumulation of errors
More complex: so far...
small
structural
germline
somatic
Delly
GATK
Freebayes
Platypus
Pindel
Mutect
Strelka
Lancet
Manta
Octopus
arbitrary
Gridss
Approach
Given:
- aligned NGS reads
- candidate variants
Find:
- classification of variants into events (somatic, germline, ...)
- unbiased allele frequency estimates
- while controlling FDR
Model
sample a
sample b
...
contaminates
Model
- insert size
- read sequence
- strand
- read orientation
- ...
Latent variables:
- allele frequency
- fragment from variant allele
- fragment correctly mapped
- biases/artifacts
Observed variables:
Model
\(\xi_i \sim \text{Bernoulli}(\alpha \theta_c \tau + (1-\alpha) \theta_h \tau)\)
\(\omega_i \sim Bernoulli(\pi_i)\)
\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)
\(\beta, \delta\)
- allele frequency
- contamination
- sampling bias
- typing uncertainty
- biases/artifacts (strand, orientation, softclip, homopolymer, ...)
- mapping uncertainty
Model
\(\Pr(Z_i \mid \theta_c,\theta_h,\beta) =\)
⊕ Bayes' theorem
= posterior probability for events
Bayesian FDR control
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
cersei:
sex: female
events:
het: "cersei:0.5"
hom: "cersei:1.0"
image: gameofthrones.fandom.com
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
cersei:
sex: female
jamie:
sex: female
joffrey:
sex: male
inheritance:
mendelian:
mother: cersei
father: jamie
events:
jamie: "jamie:0.5 & jamie:1.0"
cersei: "cersei:0.5 & cersei:1.0"
denovo_joffrey: "(joffrey:0.5 | joffrey:1.0) & !$cersei & !$jamie"
images: gameofthrones.fandom.com
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
cersei:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: cersei
contamination:
by: cersei
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
events:
germline: "cersei:0.5 | cersei:1.0"
somatic: "cersei:]0.0,0.5["
somatic_tumor_low: "cersei:0.0 & tumor:]0.0,0.1["
somatic_tumor_high: "cersei:0.0 & tumor:[0.1,1.0]"
images: gameofthrones.fandom.com, 123rf.com
Variant calling grammar
samples:
cersei:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: cersei
contamination:
by: cersei
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
relapse:
inheritance:
clonal:
from: cersei
contamination:
by: cersei
fraction: 0.2
somatic-effective-mutation-rate: 1e-6
expressions:
somatic_tumor: "cersei:0.0 & tumor:]0.0,1.0]"
events:
germline: "cersei:0.5 | cersei:1.0"
somatic: "cersei:]0.0,0.5["
somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
somatic_relapse: "cersei:0.0 & tumor:0.0 & relapse:]0.0,1.0]"
images: gameofthrones.fandom.com, 123rf.com
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
cersei_dna_illumina:
sex: female
cersei_dna_nanopore:
sex: female
inheritance:
clonal:
from: cersei
cersei_rna_illumina:
universe: [0.0,1.0]
events:
het: "cersei_dna_illumina:0.5 & cersei_dna_nanopore:0.5 & cersei_rna_illumina:]0.0,1.0]"
hom: "cersei_dna_illumina:1.0 & cersei_dna_nanopore:1.0 & cersei_rna_illumina:1.0"
White box model
chr1 1200 . GAAGCCATCTGCCCCATGGGAAAGTGACTTAAT G . .
SVLEN=-32;PROB_SOMATIC_TUMOR=0.110292;PROB_ABSENT=202.181;PROB_GERMLINE_HET=61.7704;
PROB_SOMATIC_NORMAL=16.0075;PROB_GERMLINE_HOM=4346.23;PROB_ARTIFACT=103.395
DP:AF:OBS:SB:ROB
36:0:11Np+<10Np+>8Np-<7Np->:.:.
48:0.102431:14Np+<13Np-<11Np->8Np+>1Vp-<1Vp+>:.:.
- posterior allele frequency estimate
- observations (count, posterior odds, type, strand, read orientation, ...)
- bias/artifact estimates
- event probabilities (PHRED scaled)
https://varlociraptor.github.io
Copy of Towards a unified theory of variant calling
By Johannes Köster
Copy of Towards a unified theory of variant calling
Talk at Genome Informatics 2020
- 894