Uncertainty aware analysis and exploration of genomic alterations
Johannes Köster
ADO 2024
Seeking for genomic variants
DNA
sequencing reads
aligned reads
genomic variants
sequencing
read alignment
variant calling
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
AACCGATTAACCGGAGTCCCTCGGTAGTTATTTACC
AACCGGAGTCCCTCGGTAGTTATTTACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCGCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
Sequencing
long reads:
short reads:
basecalling uncertainty:
posterior probability of incorrect base (base quality)
biorender.com
Read alignment
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
AACCGATTAACCGGAGTCCCTCGGTAGTTATTTACC
AACCGGAGTCCCTCGGTAGTTATTTACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCGCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
for each read:
find best position of a short text in a very long text (alphabet: A,C,G,T)
challenges:
- repetetive regions
- sequencing errors
- variants
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATGGCTGAT...
?
?
Alignment uncertainty
repetitive regions:
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTATGGCTGAT...
?
?
sequencing errors:
GAGTCGCAAC-----AACCGGAGTCCCGCGGTAGTTAT GAGTCGCAACAACCGGAGTCCCGCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACAACCGGAGTCCCGCGGTAGTTATGGCTGAT...
?
?
variants:
Alignment uncertainty
repetitive regions and sequencing errors:
- theoretical: for all matches
- in practice: fast approximation
goal: report posterior probability for alignment to be at the wrong locus (the so-called mapping quality or MAPQ)
Alignment uncertainty
GAGTCGCAAC-----AACCGGAGTCCCGCGGTAGTTAT GAGTCGCAACAACCGGAGTCCCGCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACAACCGGAGTCCCGCGGTAGTTATGGCTGAT... -----
variants:
use pangenomes with known variants
⤷
⤷
Aligners:
Minigraph, vg giraffe
goal: report posterior probability for alignment to be at the wrong locus (the so-called mapping quality or MAPQ)
Variant calling
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
Given:
aligned reads and reference genome
Find:
genomic variants of sample
Varlociraptor
Given:
- relationships between samples
- aligned sequence reads per sample
- candidate variants
Find:
- unbiased allele frequency estimates
- classification of variants into events (somatic, germline, ...)
- while controlling FDR
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
https://varlociraptor.github.io
Allele frequency
Normal DNA:
0.0, 0.5, 1.0 + [0.0,0.5[
Tumor DNA:
[0.0,1.0]
Allele frequency
as sampling process
red allele: 0.2
naive:
infer most likely true allele frequency from binomial model
but:
the room is dark and we cannot exactly see the colors of the balls
Varlociraptor model
\(\xi_i \sim \text{Bernoulli}(\theta \tau)\)
\(\omega_i \sim Bernoulli(\pi_i)\)
\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)
\(\beta, \delta\)
- allele frequency
- sampling bias
- allele uncertainty
- biases/artifacts (strand, orientation, softclip, homopolymer, ...)
- alignment uncertainty
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
events:
present: "jane:0.5 | jane:1.0"
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
john:
sex: male
james:
sex: male
inheritance:
mendelian:
mother: jane
father: john
events:
john: "john:0.5 | john:1.0"
jane: "jane:0.5 | jane:1.0"
denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"
Variant calling grammar
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
relapse:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.2
somatic-effective-mutation-rate: 1e-6
expressions:
somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
dna_illumina:
sex: female
dna_nanopore:
inheritance:
clonal:
from: dna_illumina
rna_illumina:
universe: [0.0,1.0]
events:
het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"
Variant calling grammar
Reporting / Interpretation uncertainty
https://datavzrd.github.io
Interpretation uncertainty
https://varsome.com
Interpretation uncertainty
https://www.genomenexus.org
Interpretation uncertainty
https://alphamissense.hegelab.org
Conclusion
The search for genomic alterations requires the consideration of various uncertainties
Comprehensive pipelines should properly assess them and transparently present them to the user.
Pipeline:
https://snakemake.github.io/snakemake-workflow-catalog/?repo=snakemake-workflows/dna-seq-varlociraptor
Tools:
- Minigraph (https://github.com/lh3/minigraph)
- VG giraffe (https://github.com/vgteam/vg/wiki/Mapping-short-reads-with-Giraffe)
- Varlociraptor (https://varlociraptor.github.io)
- Datavzrd (https://datavzrd.github.io)
- Varsome (https://varsome.com)
- Genome Nexus (https://genomenexus.org)
- AlphaMissense (https://alphamissense.hegelab.org)
Uncertainty aware analysis and exploration of genomic alterations
By Johannes Köster
Uncertainty aware analysis and exploration of genomic alterations
Talk at ADO 2024
- 75