Baysor: segmentation of spatial transcriptomics data

Viktor Petukhov ,  Peter Kharchenko

1,2

2,3

  1. University of Copenhagen, BRIC
  2. Harvard Medical School, DBMI
  3. Harvard Stem Cell Institute

https://bit.ly/2WRTWg5

Problem

How to map expression to space?

Spatial gene expression patterns

pciSeq

MERFISH

1. X. Qian, K.D. Harris, T. Hauling, D. Nicoloutsopoulos, A.M. Manchado, N. Skene, J. Hjerling-Leffler, M. Nilsson, bioRxiv 2018, 431957276097

2. Moffitt, J. R. et al. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Science 362 (2018)

[1]

[2]

  • MERFISH
  • ISS
  • osmFISH
  • smFISH
  • BaristaSeq
  • Up to 10  cells
  • Up to 10  transcripts
  • Up to 10000 genes

Spatial protocols based on RNA-FISH or in situ sequencint

  • DARTFISH
  • ex-FISH
  • StarMAP
  • seq-FISH
  • FISSEQ

Protocols

Data

6

7

Segmentation problems

Molecules

DAPI

*Moffitt, J. R. et al. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Science 362 (2018)

Segmentation problems

*Moffitt, J. R. et al. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Science 362 (2018)

Segmentation problems

MERFISH

[1]

osm-FISH

[2]

1. Moffitt, J. R. et al. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Science 362 (2018)

2. Simone Codeluppi, Lars E. Borm, Amit Zeisel, Gioele La Manno, Josina A. van Lunteren, Camilla I. Svensson & Sten Linnarsson. Nature Methods 15, 932–935 (2018)

Methods

Preparation: local gene expression

Gene coloring

X

Y

Local expression coloring

X

Y

k nearest neighbors

Gene 1 ... Gene K
N1 ... N_K

Local expression vector LE

Embed to 3d CIELAB colorspace

Baysor: segmentation of spatial transcriptomics data

X Y Gene
... ... ...

Expected

cell size

Transcript data

DAPI

Poly-A

staining

__

 

 

 

 

 

 

 

 

Optional

Algorithm: toy example

Gene 1: 20%

Gene 2: 80%

Gene 1: 80%

Gene 2: 20%

What's the source?

X Y Gene
... ... ...

We know

Baysor model

Cell as a distribution

Non-conjugate, but has good parametrization

(mean and std instead of #degrees of freedom for Inverse Gamma)

Doesn't work yet

Baysor model

Molecules as a random field

X

Y

  • Points are molecules
  • Point colors are transcript types (i.e. genes)
  • Lines are the random field edges
  • Background colors are some cell segmentation

Triangulation

w_{u,v} = p(cell(m_u) = cell(m_v) | m_u, m_v) \sim \\ \sim \frac{cor(LE_u, LE_v)}{\sqrt{(x_u - x_v)^2 + (y_u - y_v)^2}}

edge weight

local expression

Baysor: Dirichlet Mixture Models

  1. Initialize cells from some clustering (K-Shift is used)
  2. Expect probabilities of molecules to belong to the cells (E-step)
  3. Scholastically assign molecules to the cells (S-step)
  4. Maximize parameters of the cell distributions (M-step)
  5. Optionally: update priors
  6. Sample new cells from Dirichlet prior (key difference from SEM algorithm)
  7. Go to 2

Baysor: one step example

f_c(m_t) = \#molecules(c) * \\ N(x_t, y_t | \mu_c, \Sigma_c) * Cat(g_t | G_c)
p(m_t \in c) = \frac{\Sigma_{(c \in adj(t) : cell(v)=c)} w_{v,t} f_c}{\Sigma_{(c \in adj(t))} w_{v,t} f_{cell(u)}}

E-step:

Distribution for S-step:

Baysor: Alorithm demonstration

Results

Problem: validation of segmentation

  1. Number of cells
  2. Fraction of molecules, assigned to noise
  3. Number of doublets based on expression markers
  4. Contamination level based on segmentation-free cell type assignment
  5. Detailed comparison with manual segmentation of Allen smFISH

*3 and 4 are probably the same

Segmentation results

Local gene composition

Cell type

Segmentation results

Protocol Baysor Staining
osm-FISH 10059 4572
Allen sm-FISH 4435 2525
MERFISH
(subset)
9279 6119
pciSeq 2547 3413

Number of segmented cells

%of assigned molecules

Protocol Baysor Staining
osm-FISH 87.4 44.1
Allen sm-FISH 79.6 61.6
MERFISH
(subset)
75.5 47.4
pciSeq 25.7 25.8

Reducing expression contamination

[1]

1. Simone Codeluppi, Lars E. Borm, Amit Zeisel, Gioele La Manno, Josina A. van Lunteren, Camilla I. Svensson & Sten Linnarsson. Nature Methods 15, 932–935 (2018)

What is contamination?

Low expression / false positive

Contamination

Solution: segmentation-free type assignment

Local gene composition

Cell type

+

Step 1: extract markers

osmFISH paper annotation

>Inhibitory
expressed: Gad2, Pthlh, Crh
not expressed: Tbr1, Rorb, Mfge8, Cpne5

>Excitatory
expressed: Tbr1, Lamp5, Rorb, Syt6
not expressed: Mfge8, Gad2, Mrc1

>Astrocytes
expressed: Aldoc, Gfap, Serpinf1, Mfge8
not expressed: Hexb, Lamp5, Mrc1, Gad2, Sox10, Rorb, Tbr1, Syt6, Plp1

>Oligodendrocytes
expressed: Sox10, Plp1, Pdgfra, Tmem6, Itpr2, Ctps, Bmp4, Anln
not expressed: Hexb, Mrc1, Aldoc, Gfap, Gad2, Tbr1

>Microglia
expressed: Hexb
not expressed: Gad2, Tbr1, Gfap, Mfge8

>Macrophages
expressed: Mrc1
not expressed: Rorb, Lamp5, Syt6, Cpne5, Gfap, Mfge8, Plp1

>Vasculature
expressed: Flt1, Apln, Vtn, Acta2
not expressed: Lamp5, Rorb, Sox10, Gad2, Syt6, Crh

>Ventricle
expressed: Ttr, Foxj1
not expressed: Gad2, Cpne5

>Hippocampus
expressed: Kcnip
not expressed: Gad2, Tbr1, Lamp5, Rorb, Slc32a1

## Inhibitory

>Inh Crhbp
expressed: Crhbp
subtype of: Inhibitory

>Inh Cnr1
expressed: Cnr1
subtype of: Inhibitory

>Inh Kcnip
expressed: Kcnip
subtype of: Inhibitory

>Inh Pthlh
expressed: Pthlh
subtype of: Inhibitory

>Inh Vip
expressed: Vip
subtype of: Inhibitory

>Inh Crh
expressed: Crh
not expressed: Vip
subtype of: Inhibitory

## Vasculature

>Vasc Flt1
expressed: Flt1
subtype of: Vasculature

>Vasc Vtn
expressed: Vtn
subtype of: Vasculature

>Vasc Apln
expressed: Apln
subtype of: Vasculature

>Vasc Acta2
expressed: Acta2
subtype of: Vasculature

## Excitatory

>Ex Rorb
expressed: Rorb
subtype of: Excitatory

>Ex Syt6
expressed: Syt6
subtype of: Excitatory

>Ex Tbr1
expressed: Tbr1
not expressed: Syt6, Rorb
subtype of: Excitatory

>Ex Lamp5
expressed: Lamp5
not expressed: Syt6, Rorb
subtype of: Excitatory

## Oligodendrocytes

>Oligo Cop
expressed: Bmp4
subtype of: Oligodendrocytes

>Oligo MF
expressed: Ctps
subtype of: Oligodendrocytes

>Oligo NF
expressed: Itpr2
subtype of: Oligodendrocytes

>Oligo Precursors
expressed: Pdgfra
subtype of: Oligodendrocytes

>Oligo Mature
expressed: Plp1, Anln
not expressed: Itpr2, Ctps, Bmp4
subtype of: Oligodendrocytes

## Ventricle

>Ependymal
expressed: Foxj1
subtype of: Ventricle

>C. Plexus
expressed: Ttr
subtype of: Ventricle

## Astrocytes

>Astro Mfge8
expressed: Mfge8
subtype of: Astrocytes

>Astro Gfap
expressed: Gfap
not expressed: Mfge8
subtype of: Astrocytes

Extracted markers

Step 1: extract markers

osmFISH paper annotation

Pagoda embedding, same annotation

Step 1: extract markers

New annotation, level 1

New annotation, level 2

Step 1: extract markers

New annotation, level 1

New annotation, level 2

Step 2: extract local vectors

Problems:

  • 1976659 pseudo-cells
  • Expression is very sparse (10 reads per cell)

Result:

  • No graph
  • No embeddings

Step 2: extract local vectors

New annotation, level 1

New annotation, level 2

Step 3: estimate fraction of the most represented type per cell

Validation of the approach

Paper

Baysor

Validation of the approach

Cell type

Max. fraction

Validation: zoom in

Validation: add polyT

Paper

Baysor

Validation: summary

We want to improve this plot

Next steps

Cell type expression prior

Idea:

Aggregate expression over similar cells

Problems:

  • NNs depends on distance. No way to find good one a-priori
     
  • Contamination has its own patterns and similar cells are simply contaminated in the same manner

Example cell

Nearest

neighbors

Cell sampling prior

  • "Contamination" regions are too dense to be noise, and probability to form a new cell is too small

Cell sampling prior

  • "Contamination" regions are too dense to be noise, and probability to form a new cell is too small

Cell sampling prior

  1. Initialize cells from some clustering
  2. Expect probabilities of molecules to belong to the cells (E-step)
  3. Scholastically assign molecules to the cells (S-step)
  4. Maximize parameters of the cell distributions (M-step)
  5. Optionally: update priors
  6. Sample new cells from Dirichlet prior
  7. Go to 2

Split-merge algorithm

Chinese restaurant processes

Segmentation-free DAPI processing

w_{u,v} = p(cell(m_u) = cell(m_v) | m_u, m_v) \sim \\ \sim \frac{cor(LE_u, LE_v)}{\sqrt{(x_u - x_v)^2 + (y_u - y_v)^2}}
w_{u,v} = p(cell(m_u) = cell(m_v) | m_u, m_v) \sim \\ \sim \frac{cor(LE_u, LE_v)}{\sqrt{(x_u - x_v)^2 + (y_u - y_v)^2}} * F(brightness)

Transcript info

Staining info

Segmentation of "bulk" data

Slide-Seq: 10μm beads

500μm

HDST: 2μm wells

  1. Rodriques S.G., et. al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution
  2. Vickovic S., et. al. High-definition spatial transcriptomics for in situ tissue profiling

Baysor, PM 06 Nov 2019

By Viktor Petukhov

Baysor, PM 06 Nov 2019

PKLab progress meeting presentation

  • 935