Accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments

Student: Viktor Petukhov

PI: Peter Kharchenko

Background

2011-2015: South Ural State University

Applied Mathematics and Informatics

2015-2017: St. Petersburg Polytechnic University

Applied Mathematics and Informatics:

Bioinformatics

Summer 2016: Harvard Medical School

Human Science and Technology

summer institute

Introduction

*http://thescienceexplorer.com/brain-and-body/fat-tissue-shields-cancer-cells-chemo-study-finds

**https://support.10xgenomics.com/single-cell-gene-expression/instrument/doc/user-guide-chromium-single-cell-controller-readiness-test

 

Fastq files

with

barcoded

reads

(CB + UMI)

 

**

Single cells

Sequencing

*

Barcoding system

Introduction

 - Differential expression analysis

 - Cell-to-cell interactions

 - Cell type annotation

 - Pseudotime ordering

 - Gene correlation networks

*http://thescienceexplorer.com/brain-and-body/fat-tissue-shields-cancer-cells-chemo-study-finds

**https://support.10xgenomics.com/single-cell-gene-expression/instrument/doc/user-guide-chromium-single-cell-controller-readiness-test

 

Fastq files

with

barcoded

reads

(CB + UMI)

 

**

Single cells

Sequencing

*

Problem 1. Three protocols, but no universal pipeline

10x fastq

inDrop fastq

Drop-seq fastq

10x CellRanger

inDrop scripts

Drop-seq tools

Fastq files

Tools

Gene count matrix

Problem 2. High throughput leads to high level of noise

  UMI collisions

  UMI barcode errors

  Cell barcode errors

  Background cells

+

DropEst pipeline

10x fastq

inDrop fastq

Drop-seq fastq

???

 

dropEst

pipeline

(C++)

 

Protocols

Gene count matrix

UMI collisions

UMI N

...

UMI 6

UMI 5

UMI 4

UMI 3

UMI 2

UMI 1

UMI 4

UMI 7

UMI 1

Gene 1

UMI pool:

UMI 5

Gene 1

UMI 4

UMI 7

UMI 1

: x10

: x3

: x15

28 reads; 3 UMIs

Expression level: 3

UMI 4

UMI 7

UMI 1

UMI 3

UMI 9

UMI 9

: x1

: x2

: x8

28 reads; 3 UMIs

Expression level: 3

UMI 2

: x4

UMI 2

Gene 1

UMI 3

UMI 9

: x1

: x10

: x4

Cell 1

Cell 2

Cell 2

Adjustment of UMI collisions

Estimate total number of UMIs given number of different UMIs:

E(\#UMIs | \#different\ UMIs)
k=-m*(1-e^{-\frac{n}{m}})
n=-m*ln(1-\frac{k}{m})

k - #observed UMIs
n - #UMIs
m - pool size

Assume uniform distribution:

*Fu, G.K., et al., Counting individual DNA molecules by the stochastic attachment
of diverse labels. Proc Natl Acad Sci U S A, 2011. 108(22): p. 9026-31.

*

Real distribution:

Modeling of UMI collisions

UMI errors

Gene 1

Gene 1

 UMI 3: x15

UMI 2: x1

UMI 1: x3

UMI 3: x9

UMI 1: x5

 GTTA: x15

GGAC: x1

AATT: x3

UMI 4: x3

UMI 5: x1

GTTA: x9

AATT: x5

AGAG: x3

GTAA: x1

GTAA: x1

Cell 1

Cell 2

Expression

level: 3

Expression

level: 4

Real expression

level: 3

Real expression

level: 3

UMI errors. Simple approach

 GTTA: x13

TGAC: x1

AATT: x3

GGAC: x1

 GGTA: x8

 GTCA: x4

 GCCA: x2

 GTTT: x15

Total number of molecules: 8

UMI errors. Simple approach

 GTTA: x13

TGAC: x1

AATT: x3

GGAC: x1

 GGTA: x8

 GTCA: x4

 GCCA: x2

 GTTT: x15

Total number of molecules: 3

UMI errors. PCR amplification errors

 GTTA: x13

TGAC: x1

AATT: x3

GGAC: x1

 GGTA: x8

 GTCA: x4

 GCCA: x2

 GTTT: x15

Total number of molecules: 8

UMI errors. PCR amplification errors

 GTTA: x13

TGAC: x1

AATT: x3

GGAC: x1

 GGTA: x8

 GTCA: x4

 GCCA: x2

 GTTT: x15

Total number of molecules: 6

UMI errors. Bayesian approach

 GTTA: x13

TGAC: x1

AATT: x3

GGAC: x1

 GGTA: x8

 GTCA: x4

 GCCA: x2

S_g=7
U=\textbf{GTTA}
u_1,\ u_2,\ u_3
R=13
r_1,\ r_2,\ r_3
N_S=3, N_L=1

 GTTT: x15

q_1,\ q_2,\ q_3

 CTTA: x3

UMI sequence

Num. of reads

Quality

#Smaller/larger adjacent UMIs

#UMIs per gene

UMI errors. Information about target UMI

S_g=7
U=\textbf{GTTA}
R=13
N_S=3
p(U )
N_L=1
\ P(N_S | N_L, S_g, U)\

Probability to have such #adjacent UMIs given #adjacent UMIs with larger number of reads, #UMIs per gene and UMI probability

UMI errors. Information about source UMIs

u_1=\textbf{CTTA},\ u_2=\textbf{GGTA},\ u_3=\textbf{GTCA}
r_1=3,\ r_2=8,\ r_3=4
q_1,\ q_2,\ q_3
,\ R=13
\ p(Error_1 | r_1, R, p_{Err}) \sim Binom(r_1 | N=R + r_1, p = p_{err})\
,\ p_{err}
\ p(q_1|Err_1),\ p(q_2|Err_2)\ ,\ p(q_3|Err_3)

Sequence:

#Reads:

Quality:

Probability to make an error in a read

\ p(q_1|\neg Err_1),\ p(q_2|\neg Err_2),\ p(q_3|\neg Err_3)

Probability to have observed number of erroneous reads

Probability to have observed quality in the reads depends if are real or erroneous

Optimal separation on real / erroneous UMIs

Probability of error
0  -
3
4
8
q_1
q_3
q_2
r\uparrow
q\uparrow
p(Err_1) * p(\neg Err_3) * p(\neg Err_2)
p(Err_1) * p(Err_3) * p(\neg Err_3)
p(Err_1) * p(Err_2) * p(Err_3)
p(\neg Err_1) * p(\neg Err_3) * p(\neg Err_2)
p(Error_i) = p(Error_i | N_L, S_g, U, r_i, R, p_{Err}, q_i) \sim
Binom(r_i | N = (R + r + \sum_{r: r< r_i}r), p = p_{err}) *
* p(q_i) * P(N_S | N_L, S_g, U)

Maximum likelihood separation on real and erroneous reads

1'st UMI is erroneous

1'st and 3'rd UMIs are erroneous

All adjacent UMIs are erroneous

All adjacent UMIs are real

Validation of UMI correction

Probability to observe adjacent UMI depends on total #UMIs and sequencing depth:

Problem: How to know the real answer?

UMI trimming

Problem: How to know the real answer?

Idea: Let's trim UMIs and correct them.

GTTAAATTAG: x9

AATTTGCCTA: x5

GTTAAACCGT: x3

Original

AATTTG: x14

GTTAAA: x3

Trimmed

GTTAAATTAG: x9

AATTTGCCTA: x5

GTTAAACCGT: x3

Merged

("real" answer)

GATAAACCGT: x1

GATAAACCGT: x1

GATAAA: x1

Original

GTTAAATTAG: x9

AATTTGCCTA: x5

GTTAAACCGT: x3

GATAAACCGT: x1

GTTAAATTAG: x9

AATTTGCCTA: x5

GTTAAACCGT: x4

Correction

UMI trimming

Problem: How to know the real answer?

Idea: Let's trim UMIs and correct them.

Distribution of edit distances

  1. Sample pairs of UMIs.
  2. Estimate edit distances between UMIs within each pair.
  3. Estimate all pairwise distances in the dataset after different corrections.

Cell barcode errors

Gene 1

Gene 2

Gene 3

 UMI 3: x15

UMI 2: x1

UMI 1: x3

UMI 4: x2

UMI 4: x1

UMI 5: x3

AAAATTTGGG

Gene 1

Gene 2

Gene 3

UMI 3: x5

UMI 1: x1

UMI 5: x1

AAAACTTGGG

Gene 1

Gene 2

Gene 3

 UMI 3: x10

UMI 2: x1

UMI 1: x2

UMI 4: x2

UMI 4: x1

UMI 5: x2

AAAATTTGGG

List of real cell barcodes

Intersection of gene+UMI pairs

Gene 1

Gene 2

Gene 3

UMI 3: x5

UMI 1: x1

UMI 5: x1

AAAACTTGGG

Gene 1

Gene 2

Gene 3

 UMI 3: x10

UMI 2: x1

UMI 1: x2

UMI 4: x2

UMI 4: x1

UMI 5: x2

AAAATTTGGG

3 UMIs

5 UMIs

4 UMIs

S_1
S_2
S_I
P(S_I|S_1,S_2) \sim Poisson(S_I|\lambda=E(S_I|S_1,S_2))

Validation on 10x human/mouse mixture

Merge type
#Merges
Fraction of mixed merges Similarity to merge with barcodes
Poisson 8999
0.58% 99.74%
Known barcodes 8985
0.62%
100%
Merge all adjacent
21827
32.96% 20.67%

Background cells

10x 8k Human PBMCs dataset

Origin of background cells

Droplets

Cells

Barcoded cells, isolated within droplets

Background cells

10x 8k Human PBMCs dataset

Background cells. Used features

  1. Mean #reads per UMI
  2. Mean #UMI per gene
  3. Frac. of low-expressed genes
  4. Frac. of intergenic reads
  5. Frac. of not-aligned reads
  6. Frac. of mitochondrial reads

Background cells. PCA

Kernel Density Estimation

- Probabilistic interpretation

- Has single parameter (bandwidth)

Quality Scores

Filtration of background cells

Filtration of background cells

Results

Correction of

UMI and cell

barcodes

Adjustment of

UMI collisions

Filtration of background cells

10x

inDrop

Drop-seq

 

dropEst

pipeline

(C++)

 

 

dropestr

R package

(Rcpp)

 

Analysis

dropEst pipeline: accurate estimation of gene counts

Thank you for your attention!

Copenhagen interview

By Viktor Petukhov

Copenhagen interview

  • 68