Genotyping somatic insertions and deletions

Louis Dijkstra, Johannes Köster,

Tobias Marschall, Alexander Schönhuth

HiTSeq 2016

Somatic indels

CAGCATTGAAATA----GGCACAT------CGAA
CAGCATTGAAATATATAGGCACAT------CGAA
CAGCATTGAAATATATAGGCACATGCTGCTCGAA

tumor

healthy

reference

Deletions:

CAGCATTGAAATATATAGGCACATGCTGCTCGAA
CAGCATTGAAATA----GGCACATGCTGCTCGAA
CAGCATTGAAATA----GGCACAT------CGAA

tumor

healthy

reference

Insertions:

somatic

germline

Problem

Given:

aligned NGS reads from tumor and healthy sample

Find:

  • somatic indels
  • germline indels

and assess their significance while considering uncertainties

Allele frequency

Probability that genome copy in sample harbors variant

homozygous: 1.0

heterozygous: 0.5

absent: 0.0

Allele frequency

Probability that genome copy in sample harbors variant

homozygous: 1.0

heterozygous: 0.5

absent: 0.0

heteroz. in all tumor cells:

0.5 x 0.75 = 0.375

Allele frequency

Probability that genome copy in sample harbors variant

homozygous: 1.0

heterozygous: 0.5

absent: 0.0

heteroz. in red subclone:

0.5 x 0.18 = 0.09

Types of evidence

Maximum likelihood allele frequency:

healthy: 1/2

tumor: 4/7

Internal segment:

Split read:

Uncertainties:

  • alignment: correct locus?
  • typing: supports variant?

Types of evidence

Internal segment:

Split read:

Calculate:

likelihood of allele frequency while considering uncertainties

 

Available for each read:

  • probability that alignment is correct (MAPQ)
  • probability that alignment is associated with variant

 

Naive solution:

sum over all possible combinations

(       summands)

Idea

3^{|D|}
3D3^{|D|}

Latent variable model

\omega_i \sim \text{Bernoulli}(\pi_i)
ωiBernoulli(πi)\omega_i \sim \text{Bernoulli}(\pi_i)
\xi_i ~|~ \theta \sim \text{Bernoulli}(\theta)
ξi  θBernoulli(θ)\xi_i ~|~ \theta \sim \text{Bernoulli}(\theta)
X_i ~|~ w_i = 1, \xi_i = 0 \sim f_\mu (\cdot)
Xi  wi=1,ξi=0fμ()X_i ~|~ w_i = 1, \xi_i = 0 \sim f_\mu (\cdot)
X_i ~|~ w_i = 1, \xi_i = 1 \sim f_{\mu + \delta} (\cdot)
Xi  wi=1,ξi=1fμ+δ()X_i ~|~ w_i = 1, \xi_i = 1 \sim f_{\mu + \delta} (\cdot)
Y_i ~|~ w_i = 1, \xi_i = 0 \sim \text{Bernoulli}(\epsilon_0)
Yi  wi=1,ξi=0Bernoulli(ϵ0)Y_i ~|~ w_i = 1, \xi_i = 0 \sim \text{Bernoulli}(\epsilon_0)
Y_i ~|~ w_i = 1, \xi_i = 1 \sim \text{Bernoulli}(1 - \epsilon_1)
Yi  wi=1,ξi=1Bernoulli(1ϵ1)Y_i ~|~ w_i = 1, \xi_i = 1 \sim \text{Bernoulli}(1 - \epsilon_1)

typing

alignment

observation

typing

alignment

observation

\Pr(X,Y ~|~ \theta = f) = \prod_i \Pr(X_i ~|~ \theta = f) \prod_j \Pr(Y_j ~|~ \theta = f)
Pr(X,Y  θ=f)=iPr(Xi  θ=f)jPr(Yj  θ=f)\Pr(X,Y ~|~ \theta = f) = \prod_i \Pr(X_i ~|~ \theta = f) \prod_j \Pr(Y_j ~|~ \theta = f)

Likelihood in linear time

Joint model

\Pr(\text{somatic} ~|~ Z^h, Z^c) \propto \int_0^1 \Pr(Z^h,Z^t ~|~ \theta_h = 0, \theta_c = f) df
Pr(somatic  Zh,Zc)01Pr(Zh,Zt  θh=0,θc=f)df\Pr(\text{somatic} ~|~ Z^h, Z^c) \propto \int_0^1 \Pr(Z^h,Z^t ~|~ \theta_h = 0, \theta_c = f) df

tumor purity

Simulation study

Healthy:

Venter's genome (30x)

Tumor:

Venter's genome + somatic variants (40x)

Results

Conclusion

A latent variable model for calling somatic insertions and deletions that considers

  • segment and split read evidence,
  • alignment uncertainty,
  • typing uncertainty,
  • tumor purity.

 

Benefit:

  • Assess significance of somatic variants.
  • Better recall and precision.
  • Estimate allele frequency.

https://prosic.github.io

https://bioconda.github.io

Acknowledgements

Louis Dijkstra

Tobias Marschall

Alexander Schönhuth

Genotyping somatic insertions and deletions

By Johannes Köster

Genotyping somatic insertions and deletions

PROSIC talk for HiTSeq 2016

  • 2,894