Uncertainty aware and scenario agnostic genomic variant calling

Johannes Köster

Bioinformatics and Computational Oncology
Institute for AI in Medicine (IKIM)

 

2024

small

structural

SNVs, Indels

Inversions, Duplications, Indels, ...

?

?

?

?

somatic

germline

Variant calling

small

structural

?

?

somatic

germline

Variant calling

tumor/normal

sample, cohort, pedigree

?

?

small

structural

germline

somatic

GATK

Freebayes

Platypus

Pindel

Mutect

Strelka

Octopus

Lancet

Variant calling

Delly

Manta

Gridss

##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS      ID         REF   ALT    QUAL  FILTER   INFO                             FORMAT       NA00001  
20     14370    rs6054257  G     A      29    PASS    NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ  0|0:48:1:51,51 
20     17330    .          T     A      3     q10     NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ  0|0:49:3:58,50 
20     1110696  rs6040355  A     G,T    67    PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ  1|2:21:6:23,27 
20     1230237  .          T     .      47    PASS    NS=3;DP=13;AA=T                   GT:GQ:DP:HQ  0|0:54:7:56,60 
20     1234567  microsat1  GTC   G,GTCT 50    PASS    NS=3;DP=9;AA=G                    GT:GQ:DP     0/1:35:4       
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS      ID         REF   ALT    QUAL  FILTER   INFO                             FORMAT       NA00002          NA00003
20     14370    rs6054257  G     A      29    PASS    NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ  1|0:48:8:51,51   1/1:43:5:.,.
20     17330    .          T     A      3     q10     NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ  0|1:3:5:65,3     0/0:41:3
20     1110696  rs6040355  A     G,T    67    PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ  2|1:2:0:18,2     2/2:35:4
20     1230237  .          T     .      47    PASS    NS=3;DP=13;AA=T                   GT:GQ:DP:HQ  0|0:48:4:51,51   0/0:61:2
20     1234567  microsat1  GTC   G,GTCT 50    PASS    NS=3;DP=9;AA=G                    GT:GQ:DP     0/2:17:2         1/1:40:3

Problem:

accumulation of errors

More complex: so far...

small

structural

germline

somatic

Delly

GATK

Freebayes

Platypus

Pindel

Mutect

Strelka

Lancet

Manta

Octopus

arbitrary

Gridss

Varlociraptor:

Generic, scenario, variant type, and technology agnostic variant calling

Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020

https://varlociraptor.github.io

Idea:

Use per sample latent allele frequency as central readout

(het: 0.5, hom: 1.0, somatic: [0,1], ...)

Allele frequency

as sampling process

red allele: 0.2

naive:

infer most likely true allele frequency from binomial model

but:

the room is dark and we cannot exactly see the colors of the balls

https://varlociraptor.github.io

Varlociraptor model building block

\(\xi_i \sim \text{Bernoulli}(\theta \tau)\)

\(\omega_i \sim Bernoulli(\pi_i)\)

\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)

\(\beta, \delta\)

  • allele frequency
  • sampling bias
  • allele uncertainty
  • biases/artifacts (strand, orientation, softclip, homopolymer, ...)
  • alignment uncertainty

https://varlociraptor.github.io

Short and long read support

https://varlociraptor.github.io

Discovering nucleotide level and structural variants in cancer genome data from second and third generation sequencing technologies. Till Hartmann, PhD thesis 2024

\(\Pr(Z_i \mid \theta_c,\theta_h,\beta) =\)

 prior assumptions

Bayes' theorem

= posterior probability for events

Bayesian FDR control

Unified latent variable model for variant calling

Variant calling grammar

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female
events:
  present: "jane:0.5 | jane:1.0"

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female
  john:
    sex: male
  james:
    sex: male
    inheritance:
      mendelian:
        mother: jane
        father: john
events:
  john: "john:0.5 | john:1.0"
  jane: "jane:0.5 | jane:1.0"
  denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"

Variant calling grammar

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female
    somatic-effective-mutation-rate: 1e-10
  tumor:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.1
    somatic-effective-mutation-rate: 1e-6
events:
  germline: "jane:0.5 | jane:1.0"
  somatic: "jane:]0.0,0.5["
  somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
  somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"

Variant calling grammar

https://varlociraptor.github.io

samples:
  jane:
    sex: female
    somatic-effective-mutation-rate: 1e-10
  tumor:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.1
    somatic-effective-mutation-rate: 1e-6
  relapse:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.2
    somatic-effective-mutation-rate: 1e-6
expressions:
  somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"

events:
  germline: "jane:0.5 | jane:1.0"
  somatic: "jane:]0.0,0.5["
  somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
  somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
  somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"

Variant calling grammar

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  dna_illumina:
    sex: female
  dna_nanopore:
    inheritance:
      clonal:
        from: dna_illumina
  rna_illumina:
    universe: [0.0,1.0]
events:
  het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
  hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
  rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"

Variant calling grammar

https://varlociraptor.github.io

Continuous testing

varlociraptor call variants --testcase-prefix testcase --testcase-locus CHROM:POS generic \
  --scenario scenario.yaml --obs tumor=tumor.bcf normal=normal.bcf

Automatic test case generation:

145

public testcases (simulated + real benchmarks)

66

private testcases
(clinical)

https://varlociraptor.github.io

NCBench:

NGS-CN (SIG4) + PM4Onco

continuous benchmarking

GIAB, CHM-eval, SEQC, ...

Hanssen, F., Gabernet, G., Bäuerle, F., Stöcker, B., Wiegand, F., Smith, N.H., Mertes, C., Neogi, A.G., Brandhoff, L., Ossowski, A., Altmueller, J., Becker, K., Petzold, A., Sturm, M., Stöcker, T., Sivalingam, S., Brand, F., Schmidt, A., Buness, A., Probst, A.J., Motameny, S., Köster, J., 2024. NCBench: providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling. F1000

Deployment and integration

snakemake-workflows/dna-seq-varlociraptor

nf-core/sarek

conda install varlociraptor

conda create --name varlociraptor varlociraptor

Beyond just variant calling

Haplotype quantification:

utilizing Varlociraptor's using posterior allele frequency distributions [Hamdiye Uzuner]

 

Methylation calling:

joint consideration of methylation and variation, any sequencing technology [Adrian Prinz]

 

Optical mapping evidence:

extension of statistical model to consider optical mapping label positions to determine SV allele likelihoods [Laura Kühle]

 

Phased variant impact prediction:

leveraging matching between Varlociraptor observations and variants [Felix Wiegand]

 

Fusion calling:

using generic breakends [Felix Mölder]

Koesterlab

Till Hartmann (alumn.)

David Lähnemann

Felix Mölder

Adrian Prinz

Hamdiye Uzuner

Felix Wiegand

Laura Kühle

Can Özkan

Andrea Tonk

Jan Forster (alumn.)

Others

Alex Schönhuth

Louis Dijkstra

Tobias Marschall

Alice McHardy

Alexander Schramm

Ina Pretzell

Michael Wessolly

Thomas Herold

Martin Schuler

Fabian Kilpert

Christopher Schröder

Jens Siveke

Sven Thorsten Liffers

Manuel Holtgrewe

Rob Patro

Nils Homer

Acknowledgements

Uncertainty aware and scenario agnostic genomic variant calling

By Johannes Köster

Uncertainty aware and scenario agnostic genomic variant calling

  • 63