Baysor: segmentation of spatial transcriptomics data

Viktor Petukhov ,  Peter Kharchenko



  1. University of Copenhagen, BRIC
  2. Harvard Medical School, DBMI
  3. Harvard Stem Cell Institute


How to map expression to space?

Spatial gene expression patterns



1. X. Qian, K.D. Harris, T. Hauling, D. Nicoloutsopoulos, A.M. Manchado, N. Skene, J. Hjerling-Leffler, M. Nilsson, bioRxiv 2018, 431957276097

2. Moffitt, J. R. et al. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Science 362 (2018)



  • ISS
  • osmFISH
  • smFISH
  • BaristaSeq
  • Up to 10  cells
  • Up to 10  transcripts
  • Up to 10000 genes

Spatial protocols based on RNA-FISH or in situ sequencint

  • ex-FISH
  • StarMAP
  • seq-FISH





Segmentation problems



*Moffitt, J. R. et al. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Science 362 (2018)

Segmentation problems

*Moffitt, J. R. et al. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Science 362 (2018)

Segmentation problems





1. Moffitt, J. R. et al. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Science 362 (2018)

2. Simone Codeluppi, Lars E. Borm, Amit Zeisel, Gioele La Manno, Josina A. van Lunteren, Camilla I. Svensson & Sten Linnarsson. Nature Methods 15, 932–935 (2018)


Preparation: local gene expression

Gene coloring



Local expression coloring



k nearest neighbors

Gene 1 ... Gene K
N1 ... N_K

Local expression vector LE

Embed to 3d CIELAB colorspace

Baysor: segmentation of spatial transcriptomics data

X Y Gene
... ... ...


cell size

Transcript data














Algorithm: toy example

Gene 1: 20%

Gene 2: 80%

Gene 1: 80%

Gene 2: 20%

What's the source?

X Y Gene
... ... ...

We know

Baysor model

Cell as a distribution

Non-conjugate, but has good parametrization

(mean and std instead of #degrees of freedom for Inverse Gamma)

Doesn't work yet

Baysor model

Molecules as a random field



  • Points are molecules
  • Point colors are transcript types (i.e. genes)
  • Lines are the random field edges
  • Background colors are some cell segmentation


w_{u,v} = p(cell(m_u) = cell(m_v) | m_u, m_v) \sim \\ \sim \frac{cor(LE_u, LE_v)}{\sqrt{(x_u - x_v)^2 + (y_u - y_v)^2}}

edge weight

local expression

Baysor: Dirichlet Mixture Models

  1. Initialize cells from some clustering (K-Shift is used)
  2. Expect probabilities of molecules to belong to the cells (E-step)
  3. Scholastically assign molecules to the cells (S-step)
  4. Maximize parameters of the cell distributions (M-step)
  5. Optionally: update priors
  6. Sample new cells from Dirichlet prior (key difference from SEM algorithm)
  7. Go to 2

Baysor: one step example

f_c(m_t) = \#molecules(c) * \\ N(x_t, y_t | \mu_c, \Sigma_c) * Cat(g_t | G_c)
p(m_t \in c) = \frac{\Sigma_{(c \in adj(t) : cell(v)=c)} w_{v,t} f_c}{\Sigma_{(c \in adj(t))} w_{v,t} f_{cell(u)}}


Distribution for S-step:

Baysor: Alorithm demonstration


Problem: validation of segmentation

  1. Number of cells
  2. Fraction of molecules, assigned to noise
  3. Number of doublets based on expression markers
  4. Contamination level based on segmentation-free cell type assignment
  5. Detailed comparison with manual segmentation of Allen smFISH

*3 and 4 are probably the same

Segmentation results

Local gene composition

Cell type

Segmentation results

Protocol Baysor Staining
osm-FISH 10059 4572
Allen sm-FISH 4435 2525
9279 6119
pciSeq 2547 3413

Number of segmented cells

%of assigned molecules

Protocol Baysor Staining
osm-FISH 87.4 44.1
Allen sm-FISH 79.6 61.6
75.5 47.4
pciSeq 25.7 25.8

Reducing expression contamination


1. Simone Codeluppi, Lars E. Borm, Amit Zeisel, Gioele La Manno, Josina A. van Lunteren, Camilla I. Svensson & Sten Linnarsson. Nature Methods 15, 932–935 (2018)

What is contamination?

Low expression / false positive


Solution: segmentation-free type assignment

Local gene composition

Cell type


Step 1: extract markers

osmFISH paper annotation

expressed: Gad2, Pthlh, Crh
not expressed: Tbr1, Rorb, Mfge8, Cpne5

expressed: Tbr1, Lamp5, Rorb, Syt6
not expressed: Mfge8, Gad2, Mrc1

expressed: Aldoc, Gfap, Serpinf1, Mfge8
not expressed: Hexb, Lamp5, Mrc1, Gad2, Sox10, Rorb, Tbr1, Syt6, Plp1

expressed: Sox10, Plp1, Pdgfra, Tmem6, Itpr2, Ctps, Bmp4, Anln
not expressed: Hexb, Mrc1, Aldoc, Gfap, Gad2, Tbr1

expressed: Hexb
not expressed: Gad2, Tbr1, Gfap, Mfge8

expressed: Mrc1
not expressed: Rorb, Lamp5, Syt6, Cpne5, Gfap, Mfge8, Plp1

expressed: Flt1, Apln, Vtn, Acta2
not expressed: Lamp5, Rorb, Sox10, Gad2, Syt6, Crh

expressed: Ttr, Foxj1
not expressed: Gad2, Cpne5

expressed: Kcnip
not expressed: Gad2, Tbr1, Lamp5, Rorb, Slc32a1

## Inhibitory

>Inh Crhbp
expressed: Crhbp
subtype of: Inhibitory

>Inh Cnr1
expressed: Cnr1
subtype of: Inhibitory

>Inh Kcnip
expressed: Kcnip
subtype of: Inhibitory

>Inh Pthlh
expressed: Pthlh
subtype of: Inhibitory

>Inh Vip
expressed: Vip
subtype of: Inhibitory

>Inh Crh
expressed: Crh
not expressed: Vip
subtype of: Inhibitory

## Vasculature

>Vasc Flt1
expressed: Flt1
subtype of: Vasculature

>Vasc Vtn
expressed: Vtn
subtype of: Vasculature

>Vasc Apln
expressed: Apln
subtype of: Vasculature

>Vasc Acta2
expressed: Acta2
subtype of: Vasculature

## Excitatory

>Ex Rorb
expressed: Rorb
subtype of: Excitatory

>Ex Syt6
expressed: Syt6
subtype of: Excitatory

>Ex Tbr1
expressed: Tbr1
not expressed: Syt6, Rorb
subtype of: Excitatory

>Ex Lamp5
expressed: Lamp5
not expressed: Syt6, Rorb
subtype of: Excitatory

## Oligodendrocytes

>Oligo Cop
expressed: Bmp4
subtype of: Oligodendrocytes

>Oligo MF
expressed: Ctps
subtype of: Oligodendrocytes

>Oligo NF
expressed: Itpr2
subtype of: Oligodendrocytes

>Oligo Precursors
expressed: Pdgfra
subtype of: Oligodendrocytes

>Oligo Mature
expressed: Plp1, Anln
not expressed: Itpr2, Ctps, Bmp4
subtype of: Oligodendrocytes

## Ventricle

expressed: Foxj1
subtype of: Ventricle

>C. Plexus
expressed: Ttr
subtype of: Ventricle

## Astrocytes

>Astro Mfge8
expressed: Mfge8
subtype of: Astrocytes

>Astro Gfap
expressed: Gfap
not expressed: Mfge8
subtype of: Astrocytes

Extracted markers

Step 1: extract markers

osmFISH paper annotation

Pagoda embedding, same annotation

Step 1: extract markers

New annotation, level 1

New annotation, level 2

Step 1: extract markers

New annotation, level 1

New annotation, level 2

Step 2: extract local vectors


  • 1976659 pseudo-cells
  • Expression is very sparse (10 reads per cell)


  • No graph
  • No embeddings

Step 2: extract local vectors

New annotation, level 1

New annotation, level 2

Step 3: estimate fraction of the most represented type per cell

Validation of the approach



Validation of the approach

Cell type

Max. fraction

Validation: zoom in

Validation: add polyT



Validation: summary

We want to improve this plot

Next steps

Cell type expression prior


Aggregate expression over similar cells


  • NNs depends on distance. No way to find good one a-priori
  • Contamination has its own patterns and similar cells are simply contaminated in the same manner

Example cell



Cell sampling prior

  • "Contamination" regions are too dense to be noise, and probability to form a new cell is too small

Cell sampling prior

  • "Contamination" regions are too dense to be noise, and probability to form a new cell is too small

Cell sampling prior

  1. Initialize cells from some clustering
  2. Expect probabilities of molecules to belong to the cells (E-step)
  3. Scholastically assign molecules to the cells (S-step)
  4. Maximize parameters of the cell distributions (M-step)
  5. Optionally: update priors
  6. Sample new cells from Dirichlet prior
  7. Go to 2

Split-merge algorithm

Chinese restaurant processes

Segmentation-free DAPI processing

w_{u,v} = p(cell(m_u) = cell(m_v) | m_u, m_v) \sim \\ \sim \frac{cor(LE_u, LE_v)}{\sqrt{(x_u - x_v)^2 + (y_u - y_v)^2}}
w_{u,v} = p(cell(m_u) = cell(m_v) | m_u, m_v) \sim \\ \sim \frac{cor(LE_u, LE_v)}{\sqrt{(x_u - x_v)^2 + (y_u - y_v)^2}} * F(brightness)

Transcript info

Staining info

Segmentation of "bulk" data

Slide-Seq: 10μm beads


HDST: 2μm wells

  1. Rodriques S.G., et. al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution
  2. Vickovic S., et. al. High-definition spatial transcriptomics for in situ tissue profiling

Baysor, PM 06 Nov 2019

By Viktor Petukhov

Baysor, PM 06 Nov 2019

PKLab progress meeting presentation

  • 994