Role of DNA methylation in the genomic rearrangements in P. tetraurelia

PhD Defense

DELEVOYE Guillaume

09/06/2022

Supervisor : Dr MEYER Eric

Jury members : Dr DUHARCOURT Sandra, Dr CHEN Chunlong, Dr DURET Laurent

Introduction

P. tetraurelia

Studying ciliates

A 350 years old story

A. van Leeuwonhoek (1668)

"Animalcules"

Pasteur (1862)

Spontaneous generation

HS Jennings (~ 1900)

Paramecium as a model

T. Sonneborn (1937)

Non-mendelian inheritance

of sexual type

in Paramecium

Carol Greider & Elizabeth Blackburn (1985)

Nobel prize (telomeres in Tetrahymena)

Teams of E. Meyer and S. Duharcourt (2014)

Sexual type determined not by DNA but maternal RNAs, in Paramecium

There is more...

  • First known organisms that do not use the "universal" genetic code
    • Paramecium (Caron and Meyer 1985)
    • Tetrahymena (Preer et al. 1985)
  • Histone Acetlyases (HAT)
    • Tetrahymena (Brownell et al. 1996)
  • Self-splicing introns (ribozymes)
    • Tetrahymena
  • Tubulin post-translational modifications

> The genome-wide programmed rearrangements <

P. tetraurelia: Nuclear dimorphism

Unicellular eucaryote with 3 nuclei:

  • 2xMICronuclei (2n)
    • Germline nucleus
    • Contains: Transposons + Repeated sequences + 49.260 Internal Excised Sequences (IES)
    • No transcription outside meiosis

DNA ratio: 1 MIC for 200 MAC

  • 1xMACronucleus (800n)
    • Somatic nucleus
    • Derived from the MIC genome
      • Amplified
      • Free from TEs; repeated sequences and IESs
    • Transcriptionnally active
    • Rebuilt after each sexual process, under the control of the maternal MAC (hence frequent non-mendelian inheritance)

Programmed rearrangements

After sexual processes, a new MAC is formed, with important genome re-arrangements

Results in a MAC DNA almost purely made of coding sequences

Coyne et al. 2012

Profiling IESs (1/2)

  • Non-coding
  • Remnants of Tc1/Mariner
  • All excised by Pgm
    • Excised  with a single-nucleotide precision
    • Life or death issue
      • Genes interrupted
    • IES excision is exapted for the sexual type
  • Size shrinks with age
  • Most IES are very short (26-150bp)

 

49.260 unique sequences

O. Arnaiz et al 2012

PiggyMac (Pgm)

"Invade Bloom Abdicate Fade" model

Adapted from Glen Arthur Herrick 1997

100%

Profiling IESs (2/2)

  • 100% TA-Bounded
  • Weak consensus TAYAG
    • ~ Tc1/Mariner
  • Periodic size distribution ~10

O. Arnaiz et al. 2012

Not sufficient to distinguish IESs from the rest of the genome

IES recognition: scnRNA pathway

S. E. Allen and M. Nowacki - 2017

If not in the maternal MAC : Recognized and excised

Problematic

Problematic

Inactivation of scnRNA and iesRNA pathways:

  • ~30% of IESs are retained
  • Their retention is not even complete
  • Oldest = More independent to small RNAs
  • IES features = insufficient to explain the recognition

All IESs may be recognized through the small RNAs

... but is there a redundant system for the oldest/shortest ones ?

Problematic ~ Self VS non-self recognition

Hypothesis : DNA methylation

5e base <-> PERMANENT

The DNA methylation hypothesis

  • 2.1-2.5% in the MAC and MIC of P. aurelia (Cummings et al. 1975)

  • Detection by SMRT in the MAC (A. Hardy et al. 2020)
    • 0.8% and 1.6% of adenines
    • 81.5% are located in AT sites
    • Enriched downstream the Transcription Start Sites (TSS)
  • Detection in Oxytricha by L. Landweber et al. (2019)

 

N6-methyladenine (6mA) abundant in Paramecium:

?

Other:

  • 4mC ?
  • No 5mC in the MAC

If maintained in the whole cell cycle in the MAC

Could explain :

  • Conservation through replication
  • Single-nucleotide precision

% tetrahymena oxytricha

Tetrahymena : pas de 6mA dans le MIC

Faire remarquer que palindrome

Lien méthylase de maintenance (pour discussion)

CpG DNMT1

- voir si place & temps

Transcient ?

Other possible role of DNA modifications

DNA modifications could also play a role in the new MAC in formation (transiently)

 

Part of the scnRNA pathway

[ QUITTE A ... ] scRNA génèrent la méthylation pour guider plus précisément ?

Pas scnRNA

The DNA methylation hypothesis

And many other possibilities...

Méthylase de maintenance (si palindromique)

Methylase candidates

  • In 2015, DAMT-1 in C. elegans (6mA) - Preer et al.
  • MTA-70 domain of DAMT-1 identified in P. tetraurelia too
    • Silenced by RNA interference (Grouped by homology)
    • Reduction of 6mA (southwestern blot)

> Sequenced with PacBio SMSN sequencing

WT Veg

Control

silencing

T=2h

T=6h

RNA interference

Candidate methylases

Reduction 6mA

Southwesternblot

Total DNA

1:200 MIC !!!

LISTER ECHANTILLONS

Première méthylase != première 6mA

Southwestern : 90%. Pk pas nous ?

MTA1 -- orthologue 4-9-10

MTA9 -- Pas catalytique chez Tetrahymena [..] --> MT1A1B2

Notes

Subtitle

PacBio SMSN sequencing

$$ipdRatio= \frac{MeanIPD_{experience}}{unmethylated\ control}$$

~ 85% accuracy

~ 100% accuracy

Kinetic signatures

relevant\ only\ if\ \geq 25 \ measures

Global principle

Nucleotide context

DNA modifications

depending

on

6mA

12 nucleotides dans le canal

1 séquence de 12 = une vitesse

Four variants of PacBio

Only SMSN + in-sillico are compatible with our strategy

No analysis pipeline existed

> 25 measures

SMSN : Same molecule

Measured multiple times

AggSN : Aggregation of

distinct molecules

$$ipdRatio= \frac{MeanIPD_{experience}}{unmethylated\ control}$$

Whole Genome Amplified DNA (WGA)

Machine-learning

"in-sillico"

Overview

  1. Identify MIC vs MAC molecules
  2. Analyze methylation
    • No pipeline !
      •  Implement and test it
  3. 6mA MAC
  4. 6mA MIC

Isolate the MIC

molecules

The random sampling strategy

IES

Other MIC

Other MIC

IES

Mac Destinated Sequences (MDS)

MAC

TA Junction

  • We work on total DNA :
    • ~ 1 molecule out of 200 comes from the MIC
    • Sometimes the physical origin of the molecule can be guessed
      • ~ 100% accurate PacBio consensus (CCS) !
    • MIC regions that are far from the IESs and the MIC-specific sequences cannot be studied

Parler petits inserts

Expected number of IES+ sequences

  • 1 molecule out of 200 comes from the MIC
    • A bit less due to contaminants
  • ~ 1/6 of MIC inserts will carry an IES
  • # of PacBio consensus (CCS) per sample :
    • > 150.000 (multiplexed)
    • ~ 350.000 PacBio CCS (not multiplexed)

That is,

  • Expected ~100 to 300 IES+ sequences per experiment
    • Got 49 to 310

Orders of magnitude :

This is not much, but if we are right 100% of the scnRNA independent IESs could be methylated

Raw numbers of PacBio consensus (CCS) per sample and category

IES retention

  • IESs are sometimes retained in the MAC

 

P(R)

1 - P(R)

  • Quantification: "IES Retention Score" (IRS)
    • MIRET : Cyril Denby Wilkes, Olivier Arnaiz, Linda Sperling 2016, eg:

5 reads IES -

1 read IES +

2 reads IES +

$$IRS_L = \frac{2}{2+5} \approx 27\%$$

$$IRS_R = \frac{1}{1+5} \approx 16\%$$

The higher the IRS, the higher the retention.

IES retention

Pitfall

e.g

MIC = 4n, MAC = 800n, R = 0.005 , N = 100 NGS reads

$$\mathbb{E}(IRS)= 0$$

$$P(MIC|IES+) = 50\%$$

No !

Even a low IRS can be problematic for us !

When the N is small (~100), it's just impossible to see small retention levels

Due to the MAC ploidy, even the slightest retention leads to $$P(MIC|IES^+) < P(MAC|IES^+)$$

Let's just keep all IESs with an IRS = 0 ?

IES retention

Proposed approach

??

Le faire pour chaque IES

Implicitement (amalgamé : Dépend de l'IES plus que du réplicat)

IES retention

Four options to estimate R :

Hamiltonian

Monte-carlo

 

Inverse

transform

(calculus)

Reject sampling

Monte-carlo

Bayesian approaches

(credible intervals)

Frequentist approach

(confidence interval)

Computation time

Hard to implement

Expected to give similar results

$$\mathbb{E}(IRS) = P(MIC) + P(R) \cdot P(MAC) $$

IES retention

  • The 3 bayesian methods are equivalent
  • Bayesian ~ Frequentist
  • True value in confidence/credible interval 95% of times

Comparison / Benchmarks

$$P(MIC|IES+) \in [9.5\% - 93.7\%], \alpha = 5\%$$

Problem : The size of confidence intervals is very big

  • e.g 150 NGS reads, 0 IES+

For most IESs, we will simply not be able to tell wether it comes from the MAC or the MIC

IES retention

Workaround : Pooling samples

Rare picture of Eric, doing so archeology to find more samples to pool and gain coverage (circa 2022, colourized)

MITO --> Mettre pour conclure 50% retenu quand IES

 

Implicitement (amalgamé : Dépend de l'IES plus que du réplicat)

Dire explicitement que ce sont des séquen_ages d'ADN total de cellules végétatives

ENLEVER L'HISTOGRAMME

Mac ploidy : 800n ?

If MAC ploidy = 800n than without retention :

$$E(IRS) = \frac{4}{800+4} \approx 0.005$$

If retention :

$$E(IRS) >> \frac{4}{800+4}$$

Something is odd !

0.002-0.003

Mac ploidy ... At least 1600n ?

Mac ploidy ... At least 1600n ?

IES retention

Pooling samples is not sufficient !

On average, we will have only very vague estimates of P(MIC|IES+) !

Computed with Kmac = 1600n

SMSN pipeline

A few months of plumbing later...

Using E. coli DNA

  • Nearly 100% 6mA (symmetrical):
    • GATC +++
    • EcoK
    • A few others
      • Depends a lot of the strain
         
  • Nearly 0% 6mA :
    • Everything else

E.coli is used to feed paramecium (contaminants)

DOnc on peut l'utiliser pour tester le pipeline

ipdRatio in E. coli (1/2)

  • The nucleotides we expect to be methylated have a high ipdRatio
  • Slight changes between motifs
  • Some exceptions : really not-methylated ?

ipdRatio in E. coli (2/2)

Separability and coverage are correlated

How to binarize the ipdRatio ?

Either a nucleotide is methylated, or it is not :

  • We need to use a threshold on the ipdRatio to call modified nucleotides
  • This threshold has to take account of the coverage effect
  • No optimal solution anyway

Our pragmatical solution : An arbitrary linear threshold

Benchmark (6mA)

  • ~92% of 6mA in EcoK and GATC
  • ~99.8% of non-6mA elsewhere

If we make the simplification that all GATC/EcoK sites are methylated and that 6mA is only present there   :

$$Sensitivity = P(D|M)$$

$$Se  = 92\%$$

But :

  • Some GATC/EcoK are unmethylated
    • The real Se is actually better than 92%
  •  A few amount of 6mA outside of GATC/EcoK site
    • The real Sp is actually better than 99.8%
  • Se = 92% and Sp = 99.8% are worst case estimates

$$Specificity = P(\overline{D}|\overline{M})$$

$$Sp = 99.8\%$$

Benchmark

(other modifications)

PacBio sequencing was already known for its propensity to generate false positives for 4mC (K. O’Brown et al. 2014)

 

  • It is very likely that most of these detections are false positives

Qv30

Est-ce que c'est pareil pour du PCR amplifié ?

DNA modifications

In the MAC

Hallmarks in HTVEG


• Between 1.25% and 1.45% of 6mA in the MAC
• Between 97.39 and 100% of them are located in
AT sites

Taking account of the uncertainty of Se and Sp :

Problem : Some results will vary greatly depending on Se and Sp !

En faisant les corrections

The 4 scenarii for Se and Sp

  • We don't care about Se and Sp
     
  • We care about the fact that eventual mis-estimations of them doesn't really change anything

Mettre couleurs

Parfait

Sous évaluation

Sur évaluation

Quelques confusions

Global level of 6mA in the MAC

-50% NM4+9+10

>> Bulk of 6mA in the MAC

Other candidates too

Pareil que chez Tetrahymena

Role of our candidates

Symmetrical methylation of hemi-methylated AT sites

Raise of hemi-methylation, whose intensity depends importantly on how well Se and Sp are well estimated or not

Role of our candidates

De novo methylation of unmethylated AT sites

Role of our candidates

The capacity to make symmetrical methylation is never abolished completely

Outside of AT sites

Predicted FDR : 100%

But likely detection outside of AT sites too

 

AGAA and GAGG motif
are documented as methylated sites (6mA) in C. el-
egans too (Greer et al. 2015)

DNA modifications

In the MIC

A ruthless data shrinkage

Number of molecules with at least one exploitable adenine

- several IESs

- variable MAC regions

- extremity outliers

REGIONS VARIABLES !!!!!!!!

P(MIC|IES+) among the surviving molecules

The vast majority of IES+ molecules come actually... From the MAC !!!

A few MAC molecules concentrate the majority of 6mA (1/2)

A few MAC molecules concentrate the majority of 6mA (2/2)

We can never exclude the hypothesis that 6mA comes from the MAC

The 4 scenarii for Se and Sp

  • We don't care about Se and Sp
     
  • We care about the fact that eventual mis-estimations of them doesn't really change anything

Finding unbiased estimators for    and FDR

If p number of positive detections among N tests:

 

p = FP + TP

$$\pi$$

So,

Which means

And:

What it gives in Paramecium

Methodological development to correct hemi-methylation detection (1/2)

Let FD1 and FD2 be resp:

  • Fraction of hemi-methylated AT sites
  • Fraction of sym-methylated AT sites

    PZ0, PZ1, PZ2:  unbiased estimators of non, hémi, symetrically methylated AT sites

Then:

With

Methodological development to correct hemi-methylation detection (2/2)

We can also find the number of hemi-methylated sites being detected as such, and the proportion of sites detected as hemi-methylated that are really hemi-methylated. This is possible because we now approximately know PZ0, PZ1 and PZ2, and P(D|Z) is easy to determine:

Then, P(Z|D) can be determined through Bayes theorem using P(D|Z), P(Z) and P(D) (which are all known)

P(Z=1|D=1)

is our case of interest

What it gives in Paramecium

2.1 Retroingineering

The capping of IPDs


 

  • modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position

  • globalIPD is the mean of all the IPD values of the read.

  • localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences

Conclusion on the capping

  • Isn't coded as advertized by PacBio
  • The way it's implemented for AggSN is problematic and doesn't really make sense
  • Paradoxally, it should be more relevant for our approach than for the default one
  • We expect no methylation to be undetected due to the capping

 

Laura landwebehr 2020

Oxytrichia trifallax

A outAT score 20 isQv20 (812 seq)

A outAT score20 idQv20 + Strong BH correction (176 seq)

PhD defense (short)

By biocompibens

PhD defense (short)

28/02/19

  • 110