Studying Paramecium's epigenetics with PacBio sequencing

DELEVOYE Guillaume - M2 bioinformatics

Supervisors: Eric Meyer, Mathieu Bahin, Auguste Genovesio

A 340 years old story

Christopher Hooke

1665 - "Micrographia"

30x microscope

First Cell description

Antoni Van Leeuwenhoek

1678 - "Animalcules"

First observation of single-cell organisms

Infusoria, Paramecium

CIliates: Main discoveries

 

1862 - Pasteur: Refutation of the spontaneous generation theory with infusoria

1937 - Sonneborn: Non-mendelian inheritance of sexual type in paramecium

Elizabeth blackburn & Carol Greider: 1985 - Telomeres and telomerases on Tetrahymena (Nobel Prize 2009)

Eric Meyer, Sandra Duharcourt

(IBENS, I. Jacques Monod 2014)

 

Sexual type in paramecium are transmitted by maternal RNAs, not by DNA

How do we see paramecium in 2018 ?

  • Eucaryote, Ciliate
  • 3 nuclei:
    • 2xMIC nuclei (2n)
      • Reproduction
    • 1xMAC nucleus (up to 800n)
      • Transcription
      • Partial MIC

 

 

 

 

Paramecium's history

3 important events

  1. Whole genome copy (x3)
  2. MIC/MAC Differenciation
  3. PiggyMac Transposase domestication

Survival

MIC/MAC Differenciation

Transposable elements are suppressed in the MAC, where the transcription occurs

 

Avoids the negative effect of TE

PiggyMac

In 2018, question is...

...How does the cell know where to cut ?

And even more than that

  • WGD potentially implies a lot of ohnologous TE

  • How would the cell still recognize them over the time ?

Genetic drift...

Answer:

It's not fully understood

Maternal inheritance

The old MAC (maternal) drives everything during the formation of the new one:

 

  • Active transcription
  • Sexual determination (RNA)
  • Excision of IES

 

= Non-mendelian / Cytoplasmic heridity

What we already know about excision

~99% TA-boundary

 

~ 100% PGM-dependant

"Domesticated enzyme to fight transposons"

Clean cut&Paste mechanism

 

60% SCAN-RNA Pathway

IES excision (3/3): The scanRNA pathway

  • SCAN-RNA --> Excision of 60% of IES max  (shown by DICER-like silencing)

 

  • Piwi shuttle

 

  • What about the 40% remaining ?

DNA Methylation ?

So, recently in Eric's lab...

Modified bases play an important role in:

  • Procaryote's DNA/RNA
  • Eucaryote's RNA/DNA

Lots of "orphan-MTases"

Remains highly misunderstood

 

Seeking 6-mA recognition domain

Protein identification

RNA silencing

Death

--> Role in IES excision ?

Currently being studied

Hypothesis

Please cut me !

Don't cut me !

Modified bases: A large landscape of possibilities

Thymine = 5-methyl-uracile ?

And many others in RNA...

Pseudouridine (ψ) Dihydrouridine (D) m1A m2A m6A m62A i6A t6A Am Ar(p) DHT m3U mo5U s2U s4U iG m1G m2G m22G Gm k2C iC ψC ψiC m3C m4C m5C m42C Cm ac4C s2C m1I Im...

<-- 3 known most frequent

methylations in DNA

Challenges

  • Sequencing : Sorting MIC and MAC

    • Not experimentally possible
      • Or we loose the methylation information...
  • 98% of IES are < 200 pb: "Long read" needed (to sort)

 

  • MIC: 1/400 to 1/800 sequences only !
  • Resolution needs:
    • Single molecule
    • Single strand
    • Single nucleotide
  • Identification of modified bases, including unknown ones
  • 3 Whole genome duplication ! (mapping)

 

PacBio sequencing meets all expectations

99% accuracy

Max

75% accuracy

Methylation analysis needs local > 25X

Trained model (ML) allows detection of suspect downturns of polymerase (function of the -3/+8 nt context) --> IPD are captured

SMRTLink

A very user-friendly browser interface

Advanced analysis are impossible

--> Command-line tools required

Theory: Experimental strategy

Classical PacBio Approach: Higher depth by overlapping the holes

Our approach: Shorter, real single hole analysis, much more passes

We ~always have either 0X or >>> 25X

But reality can be cruel...

Murphy's Law statistically hits a lot if you're trying 500.000 times

Other cases:

  • Some IES are excised, some others not
  • Partial excision...
  • Multiple mapping that are ~ equivalent
  • Not the same mapping position between MAC and MIC
  • etc...

Step 1: Sorting the sequences

Create and align the consensus

0 - Quality filter (Z-score)

 

1 - Create the consensus (=CCS)

 

2 - Map the CCS on MAC / MAC+IES (BLASR)

--> Only the best alignment reported (forced)

 

3 - Filter 90% identity for either MAC or MAC+IES

 

4 - Compare the mapping

Reminder: CCS are expected to be somewhat around 99% accuracy

Differential mapping

Sequences that come from MIC

N changes

"=" becomes "I"

Threshold N = 27

At least N = 27 positions map on IES

Diffential mapping

Sequences that come from MAC

A least 2 positions have to be on a junction

N > 27 changes

"D doesn't match I on expected IES"

Other cases:

  • Put in "unknown" group:

    • Low identity CCS
    • N number of changes under threshold (27)
    • CCS mapping on totally different scaffolds, with about the same scores/identity
    • CCS that didn't map on any IES or IES junction

 

Everything else:

"Trash"

 

Visual inspection

  • First intuitive checking
  • Stringent --> Better so ! (FP are avoided)

Rigorous Benchmarking

  • For now mostly visual checking, but:
    • Automatic checking of presence/absence of IES/junction
    • Coherent number of sorted/unsorted DNA
  • Soon a rigorous benchmarking (TP/FP/TN/FN): PBSIM

Step 2

Recreating the corresponding .bam files

  • Can need a lot of RAM for the hashed lookup tables (python dicts)
  • Ok on the cluster with higher RAM usage
  • 218 holes of MIC inserts that cross an IES
    • All greatly covered (>25X to 100X)

Step 3:

Single hole Analysis of methylation

Remember

Everyone else's PacBio usage:

Eric, a weird scientist:

Pauses of the polymerase:

A statistical problem

Sometimes the polymerase is "Lazy" and stops for a VERY long time --> Really annoying (methylation analysis)

Outliers:

  •  Very big compared to others
  •  Really disturb all computations related to the mean:

 

Modification detection, modification identification, p-value, PHRED-score...

Outliers: How to deal with them ?

  • We want to eliminate very big values
  • We don't want to loose high IPDs correlated with methylation
  • Trade-off

What's in PacBio's statistical blackbox

How PacBio handles it

--->> Values are "capped":

cappingValue = max(99th chunk, 4* modele, 75th local percentile)

  • We need the mean
    • p-value, PHRED... Rely on the mean
    • We cannot take the median
  • Some context of nucleotides are naturally slower than others
    • Outliers are relative !!
    • Context is important
    • They should represent far less than 1% of all IPDs values

For every position in the reference:

What if we analyze hole/hole ?

cappingValue = max(99th chunk, 4* modele, 75th local percentile)

Probably it doesn't change much

But it's rigorously not the same

Someone already worked on it over the last months

 

Unfortunately

  • Old .h5 format
  • No p-value, no PHRED-score, no identification
  • Many statistical refinments, but not those we need

The solutions

  • Median <-> Maybe not ideal ?

 

 

  • Let's pretend we're doing it normally
  • Pre-compute all the capping values
  • Re-use those capping values in a true single-hole analysis

 

>>> This is also how the recently publication of Beaulaurier works (with hd5)

Hacking PacBio

PacBio "kineticsTools" could work only:

  • On whole genome
  • Without giving all the values it computed

 

Now it's properly hacked:

  • Spits every capping values so that we can study them
  • Works on little chunks / Single hole (much faster)
  • Properly hacked to work in Parallel (Multiprocessing)
  • Not Condor ready
  • Up to 80 processors on Bioclusts02
  • ~ 10s/hole on Bioclusts02

A first try

Strand issue (BLASR) to fix: Only strand 0 is available

Quality threshold

Real results

What remains to do

  • Annoying behaviours of BLASR to fix
  • BLASR doesn't perform great for "one position only"
    • Alternatives exist... I didn't investigate it yet
  • Benchmark the sorting
  • Compare modified-capping VS naive approach
  • Motif analysis
  • Stats SCAN-RNA vs independant ++
  • Other samples
  • Nextflow for reproductibility

Studying Paramecium's methylation with PacBio sequencing

By biocompibens

Studying Paramecium's methylation with PacBio sequencing

Lab meeting - 19/06/18

  • 98