Uncertainty aware analysis and exploration of genomic alterations
Johannes Köster
ADO 2024
Seeking for genomic variants
DNA
sequencing reads
aligned reads
genomic variants
sequencing
read alignment
variant calling
![](https://media.slid.es/uploads/362168/images/5343166/chromosomes.png)
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
AACCGATTAACCGGAGTCCCTCGGTAGTTATTTACC
AACCGGAGTCCCTCGGTAGTTATTTACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCGCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
Sequencing
![](https://media.slid.es/uploads/362168/images/11539321/pasted-from-clipboard.png)
long reads:
![](https://media.slid.es/uploads/362168/images/11539327/pasted-from-clipboard.png)
short reads:
basecalling uncertainty:
posterior probability of incorrect base (base quality)
biorender.com
Read alignment
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
AACCGATTAACCGGAGTCCCTCGGTAGTTATTTACC
AACCGGAGTCCCTCGGTAGTTATTTACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCGCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
for each read:
find best position of a short text in a very long text (alphabet: A,C,G,T)
challenges:
- repetetive regions
- sequencing errors
- variants
![](https://media.slid.es/uploads/362168/images/11539353/pasted-from-clipboard.png)
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATGGCTGAT...
?
?
Alignment uncertainty
repetitive regions:
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTATGGCTGAT...
?
?
sequencing errors:
GAGTCGCAAC-----AACCGGAGTCCCGCGGTAGTTAT GAGTCGCAACAACCGGAGTCCCGCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACAACCGGAGTCCCGCGGTAGTTATGGCTGAT...
?
?
variants:
Alignment uncertainty
repetitive regions and sequencing errors:
- theoretical: for all matches
- in practice: fast approximation
![](https://media.slid.es/uploads/362168/images/11539259/pasted-from-clipboard.png)
goal: report posterior probability for alignment to be at the wrong locus (the so-called mapping quality or MAPQ)
Alignment uncertainty
GAGTCGCAAC-----AACCGGAGTCCCGCGGTAGTTAT GAGTCGCAACAACCGGAGTCCCGCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACAACCGGAGTCCCGCGGTAGTTATGGCTGAT... -----
variants:
use pangenomes with known variants
![](https://media.slid.es/uploads/362168/images/11539267/pasted-from-clipboard.png)
⤷
⤷
Aligners:
Minigraph, vg giraffe
goal: report posterior probability for alignment to be at the wrong locus (the so-called mapping quality or MAPQ)
Variant calling
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
Given:
aligned reads and reference genome
Find:
genomic variants of sample
Varlociraptor
Given:
- relationships between samples
- aligned sequence reads per sample
- candidate variants
Find:
- unbiased allele frequency estimates
- classification of variants into events (somatic, germline, ...)
- while controlling FDR
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
https://varlociraptor.github.io
Allele frequency
Normal DNA:
0.0, 0.5, 1.0 + [0.0,0.5[
Tumor DNA:
[0.0,1.0]
Allele frequency
as sampling process
red allele: 0.2
naive:
infer most likely true allele frequency from binomial model
but:
the room is dark and we cannot exactly see the colors of the balls
Varlociraptor model
ξi∼Bernoulli(θτ)
ωi∼Bernoulli(πi)
Zi∣ξi,ωi=1,β,δ∼
![](https://media.slid.es/uploads/362168/images/5482648/pair-hmm.png)
β,δ
- allele frequency
- sampling bias
- allele uncertainty
- biases/artifacts (strand, orientation, softclip, homopolymer, ...)
- alignment uncertainty
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
events:
present: "jane:0.5 | jane:1.0"
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
john:
sex: male
james:
sex: male
inheritance:
mendelian:
mother: jane
father: john
events:
john: "john:0.5 | john:1.0"
jane: "jane:0.5 | jane:1.0"
denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"
Variant calling grammar
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
relapse:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.2
somatic-effective-mutation-rate: 1e-6
expressions:
somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
dna_illumina:
sex: female
dna_nanopore:
inheritance:
clonal:
from: dna_illumina
rna_illumina:
universe: [0.0,1.0]
events:
het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"
Variant calling grammar
![](https://media.slid.es/uploads/362168/images/11540454/anonymized_report.png)
Reporting / Interpretation uncertainty
https://datavzrd.github.io
![](https://media.slid.es/uploads/362168/images/11042610/Varsome.png)
Interpretation uncertainty
https://varsome.com
Interpretation uncertainty
![](https://media.slid.es/uploads/362168/images/11042611/GenomeNexus.png)
https://www.genomenexus.org
Interpretation uncertainty
![](https://media.slid.es/uploads/362168/images/11540473/pasted-from-clipboard.png)
https://alphamissense.hegelab.org
Conclusion
The search for genomic alterations requires the consideration of various uncertainties
Comprehensive pipelines should properly assess them and transparently present them to the user.
Pipeline:
https://snakemake.github.io/snakemake-workflow-catalog/?repo=snakemake-workflows/dna-seq-varlociraptor
Tools:
- Minigraph (https://github.com/lh3/minigraph)
- VG giraffe (https://github.com/vgteam/vg/wiki/Mapping-short-reads-with-Giraffe)
- Varlociraptor (https://varlociraptor.github.io)
- Datavzrd (https://datavzrd.github.io)
- Varsome (https://varsome.com)
- Genome Nexus (https://genomenexus.org)
- AlphaMissense (https://alphamissense.hegelab.org)
Uncertainty aware analysis and exploration of genomic alterations
By Johannes Köster
Uncertainty aware analysis and exploration of genomic alterations
Talk at ADO 2024
- 124