Uncertainty aware and scenario agnostic genomic variant calling
Johannes Köster
Bioinformatics and Computational Oncology
Institute for AI in Medicine (IKIM)
2024
small
structural
SNVs, Indels
Inversions, Duplications, Indels, ...
?
?
?
?
somatic
germline
Variant calling
small
structural
?
?
somatic
germline
Variant calling
tumor/normal
sample, cohort, pedigree
?
?
small
structural
germline
somatic
GATK
Freebayes
Platypus
Pindel
Mutect
Strelka
Octopus
Lancet
Variant calling
Delly
Manta
Gridss
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/2:17:2 1/1:40:3
⊖
Problem:
accumulation of errors
More complex: so far...
small
structural
germline
somatic
Delly
GATK
Freebayes
Platypus
Pindel
Mutect
Strelka
Lancet
Manta
Octopus
arbitrary
Gridss
Varlociraptor:
Generic, scenario, variant type, and technology agnostic variant calling
Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020
https://varlociraptor.github.io
Idea:
Use per sample latent allele frequency as central readout
(het: 0.5, hom: 1.0, somatic: [0,1], ...)
Allele frequency
as sampling process
red allele: 0.2
naive:
infer most likely true allele frequency from binomial model
but:
the room is dark and we cannot exactly see the colors of the balls
https://varlociraptor.github.io
Varlociraptor model building block
\(\xi_i \sim \text{Bernoulli}(\theta \tau)\)
\(\omega_i \sim Bernoulli(\pi_i)\)
\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)
\(\beta, \delta\)
- allele frequency
- sampling bias
- allele uncertainty
- biases/artifacts (strand, orientation, softclip, homopolymer, ...)
- alignment uncertainty
https://varlociraptor.github.io
Short and long read support
https://varlociraptor.github.io
Discovering nucleotide level and structural variants in cancer genome data from second and third generation sequencing technologies. Till Hartmann, PhD thesis 2024
\(\Pr(Z_i \mid \theta_c,\theta_h,\beta) =\)
⊕ prior assumptions
⊕ Bayes' theorem
= posterior probability for events
Bayesian FDR control
Unified latent variable model for variant calling
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
events:
present: "jane:0.5 | jane:1.0"
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
john:
sex: male
james:
sex: male
inheritance:
mendelian:
mother: jane
father: john
events:
john: "john:0.5 | john:1.0"
jane: "jane:0.5 | jane:1.0"
denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"
Variant calling grammar
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"
Variant calling grammar
https://varlociraptor.github.io
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
relapse:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.2
somatic-effective-mutation-rate: 1e-6
expressions:
somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"
Variant calling grammar
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
dna_illumina:
sex: female
dna_nanopore:
inheritance:
clonal:
from: dna_illumina
rna_illumina:
universe: [0.0,1.0]
events:
het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"
Variant calling grammar
https://varlociraptor.github.io
Continuous testing
varlociraptor call variants --testcase-prefix testcase --testcase-locus CHROM:POS generic \
--scenario scenario.yaml --obs tumor=tumor.bcf normal=normal.bcf
Automatic test case generation:
145
public testcases (simulated + real benchmarks)
66
private testcases
(clinical)
https://varlociraptor.github.io
NCBench:
NGS-CN (SIG4) + PM4Onco
continuous benchmarking
GIAB, CHM-eval, SEQC, ...
Hanssen, F., Gabernet, G., Bäuerle, F., Stöcker, B., Wiegand, F., Smith, N.H., Mertes, C., Neogi, A.G., Brandhoff, L., Ossowski, A., Altmueller, J., Becker, K., Petzold, A., Sturm, M., Stöcker, T., Sivalingam, S., Brand, F., Schmidt, A., Buness, A., Probst, A.J., Motameny, S., Köster, J., 2024. NCBench: providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling. F1000
Deployment and integration
snakemake-workflows/dna-seq-varlociraptor
nf-core/sarek
conda install varlociraptor
conda create --name varlociraptor varlociraptor
Beyond just variant calling
Haplotype quantification:
utilizing Varlociraptor's using posterior allele frequency distributions [Hamdiye Uzuner]
Methylation calling:
joint consideration of methylation and variation, any sequencing technology [Adrian Prinz]
Optical mapping evidence:
extension of statistical model to consider optical mapping label positions to determine SV allele likelihoods [Laura Kühle]
Phased variant impact prediction:
leveraging matching between Varlociraptor observations and variants [Felix Wiegand]
Fusion calling:
using generic breakends [Felix Mölder]
Koesterlab
Till Hartmann (alumn.)
David Lähnemann
Felix Mölder
Adrian Prinz
Hamdiye Uzuner
Felix Wiegand
Laura Kühle
Can Özkan
Andrea Tonk
Jan Forster (alumn.)
Others
Alex Schönhuth
Louis Dijkstra
Tobias Marschall
Alice McHardy
Alexander Schramm
Ina Pretzell
Michael Wessolly
Thomas Herold
Martin Schuler
Fabian Kilpert
Christopher Schröder
Jens Siveke
Sven Thorsten Liffers
Manuel Holtgrewe
Rob Patro
Nils Homer
Acknowledgements
Uncertainty aware and scenario agnostic genomic variant calling
By Johannes Köster
Uncertainty aware and scenario agnostic genomic variant calling
- 57