Johannes Köster
ADO 2024
DNA
sequencing reads
aligned reads
genomic variants
sequencing
read alignment
variant calling
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
AACCGATTAACCGGAGTCCCTCGGTAGTTATTTACC
AACCGGAGTCCCTCGGTAGTTATTTACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCGCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
long reads:
short reads:
basecalling uncertainty:
posterior probability of incorrect base (base quality)
biorender.com
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
AACCGATTAACCGGAGTCCCTCGGTAGTTATTTACC
AACCGGAGTCCCTCGGTAGTTATTTACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCGCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
for each read:
find best position of a short text in a very long text (alphabet: A,C,G,T)
challenges:
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATGGCTGAT...
?
?
repetitive regions:
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTATGGCTGAT...
?
?
sequencing errors:
GAGTCGCAAC-----AACCGGAGTCCCGCGGTAGTTAT GAGTCGCAACAACCGGAGTCCCGCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACAACCGGAGTCCCGCGGTAGTTATGGCTGAT...
?
?
variants:
repetitive regions and sequencing errors:
goal: report posterior probability for alignment to be at the wrong locus (the so-called mapping quality or MAPQ)
GAGTCGCAAC-----AACCGGAGTCCCGCGGTAGTTAT GAGTCGCAACAACCGGAGTCCCGCGGTAGTTAT
GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTAGAGTCGCAACAACCGGAGTCCCGCGGTAGTTATGGCTGAT... -----
variants:
use pangenomes with known variants
⤷
⤷
Aligners:
Minigraph, vg giraffe
goal: report posterior probability for alignment to be at the wrong locus (the so-called mapping quality or MAPQ)
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
Given:
aligned reads and reference genome
Find:
genomic variants of sample
Given:
Find:
AACCGATTAACCGGAGTCCCGCGGTAGTTATTTACC
AACCGGAGTCCCGCGGTAGTTATTGACCCTCTCCGC
AGTCCCTCGGTAGTTATTTACCCTCTCCGCGTCCTTTC
ATCCGGAGTCCCAACCGATTAACCGGAGTCCCT
GAGTCGCAACCGATTAACCGGAGTCCCTCGGTAGTTAT
...GTAATCCGGAGTCGCAACCGATTAACCGGAGTCCCGCGGTAGTTATTTACCCTCTCCGCGTCCTTTCTA...
https://varlociraptor.github.io
Normal DNA:
0.0, 0.5, 1.0 + [0.0,0.5[
Tumor DNA:
[0.0,1.0]
as sampling process
red allele: 0.2
naive:
infer most likely true allele frequency from binomial model
but:
the room is dark and we cannot exactly see the colors of the balls
\(\xi_i \sim \text{Bernoulli}(\theta \tau)\)
\(\omega_i \sim Bernoulli(\pi_i)\)
\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)
\(\beta, \delta\)
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
events:
present: "jane:0.5 | jane:1.0"
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
john:
sex: male
james:
sex: male
inheritance:
mendelian:
mother: jane
father: john
events:
john: "john:0.5 | john:1.0"
jane: "jane:0.5 | jane:1.0"
denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
relapse:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.2
somatic-effective-mutation-rate: 1e-6
expressions:
somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
dna_illumina:
sex: female
dna_nanopore:
inheritance:
clonal:
from: dna_illumina
rna_illumina:
universe: [0.0,1.0]
events:
het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"
https://datavzrd.github.io
https://varsome.com
https://www.genomenexus.org
https://alphamissense.hegelab.org
The search for genomic alterations requires the consideration of various uncertainties
Comprehensive pipelines should properly assess them and transparently present them to the user.
Pipeline:
https://snakemake.github.io/snakemake-workflow-catalog/?repo=snakemake-workflows/dna-seq-varlociraptor
Tools: