# Genotyping somatic insertions and deletions

Louis Dijkstra, Johannes Köster,

Tobias Marschall, Alexander Schönhuth

HiTSeq 2016

## Somatic indels

CAGCATTGAAATA----GGCACAT------CGAA

CAGCATTGAAATATATAGGCACAT------CGAA
CAGCATTGAAATATATAGGCACATGCTGCTCGAA

tumor

healthy

reference

Deletions:

CAGCATTGAAATATATAGGCACATGCTGCTCGAA

CAGCATTGAAATA----GGCACATGCTGCTCGAA
CAGCATTGAAATA----GGCACAT------CGAA

tumor

healthy

reference

Insertions:

somatic

germline

## Problem

Given:

aligned NGS reads from tumor and healthy sample

Find:

• somatic indels
• germline indels

and assess their significance while considering uncertainties

## Allele frequency

Probability that genome copy in sample harbors variant

homozygous: 1.0

heterozygous: 0.5

absent: 0.0

## Allele frequency

Probability that genome copy in sample harbors variant

homozygous: 1.0

heterozygous: 0.5

absent: 0.0

heteroz. in all tumor cells:

0.5 x 0.75 = 0.375

## Allele frequency

Probability that genome copy in sample harbors variant

homozygous: 1.0

heterozygous: 0.5

absent: 0.0

heteroz. in red subclone:

0.5 x 0.18 = 0.09

## Types of evidence

Maximum likelihood allele frequency:

healthy: 1/2

tumor: 4/7

Internal segment:

Uncertainties:

• alignment: correct locus?
• typing: supports variant?

## Types of evidence

Internal segment:

Calculate:

likelihood of allele frequency while considering uncertainties

• probability that alignment is correct (MAPQ)
• probability that alignment is associated with variant

Naive solution:

sum over all possible combinations

(       summands)

## Idea

3^{|D|}
$3^{|D|}$

## Latent variable model

\omega_i \sim \text{Bernoulli}(\pi_i)
$\omega_i \sim \text{Bernoulli}(\pi_i)$
\xi_i ~|~ \theta \sim \text{Bernoulli}(\theta)
$\xi_i ~|~ \theta \sim \text{Bernoulli}(\theta)$
X_i ~|~ w_i = 1, \xi_i = 0 \sim f_\mu (\cdot)
$X_i ~|~ w_i = 1, \xi_i = 0 \sim f_\mu (\cdot)$
X_i ~|~ w_i = 1, \xi_i = 1 \sim f_{\mu + \delta} (\cdot)
$X_i ~|~ w_i = 1, \xi_i = 1 \sim f_{\mu + \delta} (\cdot)$
Y_i ~|~ w_i = 1, \xi_i = 0 \sim \text{Bernoulli}(\epsilon_0)
$Y_i ~|~ w_i = 1, \xi_i = 0 \sim \text{Bernoulli}(\epsilon_0)$
Y_i ~|~ w_i = 1, \xi_i = 1 \sim \text{Bernoulli}(1 - \epsilon_1)
$Y_i ~|~ w_i = 1, \xi_i = 1 \sim \text{Bernoulli}(1 - \epsilon_1)$

typing

alignment

observation

typing

alignment

observation

\Pr(X,Y ~|~ \theta = f) = \prod_i \Pr(X_i ~|~ \theta = f) \prod_j \Pr(Y_j ~|~ \theta = f)
$\Pr(X,Y ~|~ \theta = f) = \prod_i \Pr(X_i ~|~ \theta = f) \prod_j \Pr(Y_j ~|~ \theta = f)$

## Joint model

\Pr(\text{somatic} ~|~ Z^h, Z^c) \propto \int_0^1 \Pr(Z^h,Z^t ~|~ \theta_h = 0, \theta_c = f) df
$\Pr(\text{somatic} ~|~ Z^h, Z^c) \propto \int_0^1 \Pr(Z^h,Z^t ~|~ \theta_h = 0, \theta_c = f) df$

tumor purity

## Simulation study

Healthy:

Venter's genome (30x)

Tumor:

Venter's genome + somatic variants (40x)

## Conclusion

A latent variable model for calling somatic insertions and deletions that considers

• segment and split read evidence,
• alignment uncertainty,
• typing uncertainty,
• tumor purity.

Benefit:

• Assess significance of somatic variants.
• Better recall and precision.
• Estimate allele frequency.

https://prosic.github.io

https://bioconda.github.io

## Acknowledgements

Louis Dijkstra

Tobias Marschall

Alexander Schönhuth