Role of DNA methylation in the genomic rearrangements in P. tetraurelia

PhD Defense

DELEVOYE Guillaume

09/06/2022

Supervisor : Dr MEYER Eric

Jury members : Dr DUHARCOURT Sandra, Dr CHEN Chunlong, Dr DURET Laurent

Introduction

P. tetraurelia

P. tetraurelia: Nuclear dimorphism

Unicellular eucaryote (ciliate) with 3 nuclei:

  • 2xMICronuclei (2n)
    • Germline nucleus
    • Contains: Transposons + 49.260 Internal Excised Sequences (IES)
    • No transcription outside of meiosis

DNA ratio: 1 MIC for 200 MAC

  • 1xMACronucleus (800n)
    • Somatic nucleus
    • Derived from the MIC genome
      • Amplified
      • Free from TEs and IESs
    • Transcriptionnally active
    • Rebuilt after each sexual process

Programmed rearrangements

After sexual processes, a new MAC is formed, with important genome re-arrangements

Results in a MAC DNA almost purely made of coding sequences

Coyne et al. 2012

Profiling IESs (1/2)

  • Non-coding
  • Remnants of Tc1/Mariner ?
  • All excised by Pgm
    • Excised  with a single-nucleotide precision
    • Life or death issue : Genes interrupted
  • IES excision was exapted many times
    • e.g The mating-type !
  • Size shrinks with age, most IESs are very short (26-150bp)

49.260 unique sequences

O. Arnaiz et al 2012

"Invade Bloom Abdicate Fade" model

Adapted from Glen Arthur Herrick 1997

Excision = 100% PiggyMac (Pgm)

Profiling IESs (2/2)

  • 100% TA-Bounded
  • Weak consensus TAYAG
    • ~ Tc1/Mariner

       


       
  • Periodic size distribution ~10bp

O. Arnaiz et al. 2012

Not sufficient to distinguish IESs from the rest of the genome

IES recognition: scnRNA pathway

S. E. Allen and M. Nowacki - 2017

If not in the maternal MAC : Recognized and excised

Problematic

Problematic

Inactivation of scnRNA and iesRNA pathways:

  • ~30% of IESs are retained
  • Their retention is not even complete
  • Oldest = More independent to small RNAs
  • IES features = insufficient to explain the recognition

All IESs may be recognized through the small RNAs

... but is there a redundant system for the oldest/shortest ones ?

Problematic ~ Self VS non-self recognition

Hypothesis : Role of DNA modifications

Two possible roles of DNA methylation

DNA modifications could play a role :

  • In the recognition of IESs +++
    • i.e : It is permanently present in the MIC
  • In their excision
    • i.e : It is transiently present in the new forming MAC, right when the IESs are excised.
      • ~ Actor of the scnRNA pathway

Hypothesis

The fifth element

  • 2.1-2.5% in the MAC and MIC of P. aurelia (Cummings et al. 1975)

  • Detection by SMRT in the MAC (A. Hardy et al. 2020)
    • 0.8% and 1.6% of adenines
    • 81.5% are located in AT sites
    • Enriched downstream the Transcription Start Sites (TSS)
  • In other ciliates :
    • AT sites ++ and TSS ++ :
      • Oxytricha by L. Landweber et al. (2019)
      • Tetrahymena (No 6mA in MIC)

 

... Could be N6-methyladenine (6mA)

?

Other:

  • 4mC ?
  • No 5mC in the MAC

Police-sketch of the methylation pattern

If the pattern allows the recognition of IESs, it must allow :

  1. Single-nucleotide precision
  2. Distinction between a TA of an IES, from another TA elsewhere
  3. Conservation through replication

A typical example : DNMT1-Like system

The DNA methylation hypothesis

Examples of possible patterns

And many other possibilities...

Maintenance

through replication ?

Single-nucleotide precision OK

Maintenance through replication : OK if DNMT1-Like

Single-nucleotide precision ?

Experimental approach

Long before I arrived in the lab !

Methylase candidates

  • DAMT-1 in C. elegans (6mA) Greer et al. 2015
     
  • MTA-70 domain of DAMT-1 identified in P. tetraurelia too

Expr. Sexual events (Collaborators)

RNA methyltransferases ?

Our 6 proteins

 

  • NM4
  • NM9
  • NM10
  • MT1A
  • MT1B
  • MT2

Sequencing strategy

WT Veg

Control

silencing

T=2h

T=6h

RNA interference

Candidate methylases

Reduction 6mA

Southwesternblot

up to 90%

Total DNA

1:200 MIC !!!

Protocol :

  1. Sequence vegetative and autogamous cells
  2. Silence the methylase candidates by RNA interference
  3.  Sequence with PacBio SMRT sequencing

Autogamy

PacBio SMRT sequencing

~ 85% accuracy

~ 100% accuracy

if many passes

10 – 15 kb

IPD

Nucleotide context (-3/+8nt)

Kinetic signatures

Depends on

 

DNA modifications

ipdRatio

$$ipdRatio= \frac{MeanIPD_{experience}}{unmethylated\ control}$$

(Relevant if >25 measures)

IPD

Two types of control :

  • PCR
  • In-silico (machine-learning)

Ideal approach

  1. Purify MIC DNA
  2. Make libraries of long inserts from it
    • Several kb
  3. Sequence it with PacBio SMRT
  4. Compare with PCR of full MIC genome

e.g

~O(20) to ~O(100) molecules

[...]

  • Position n°7260 in the genome
  • Strand +
  •  52 molecules methylated out of 100

Problem = Purification of MIC DNA

(min.  25X)

Transcient ?

  • Fish IES+ molecules in a sea of MAC molecules
  • Short inserts (~350bp)
  • Same molecule sequenced multiple times

Random sampling strategy

each molecule

~O(20) to ~O(100) measures

(MIC)

Workaround : work on total DNA instead

  • Molécule n°49256
  • Nucléotide n°7, strand + (Adénine)
  • ipdRatio : 20x

But purifying the MIC is problematic

(min.  25X)

Look at methylation (only) around (some) IESs

in-silico control required

My job starts here !

NM4.bam

NM9_10.bam

MT1A_1B.bam

MT1A_1B_2.bam

MT2.bam

 

HTVEG.bam

MAB.bam

HT2.bam

HT6.bam

NM4_9_10.bam

Challenge = Sorting + No pipeline

The guiding thread

  1. Identify MIC vs MAC molecules
  2. DNA methylation detection :
    • No pipeline !
    • Implementation and test
  3. 6mA MAC
  4. 6mA MIC

Identify MIC vs MAC molecules

Sorting (overview)

~ 85% accuracy

~ 100% accuracy

10 – 15 kb

~350bp

Alignment

MAC

MAC+IES

MIC

The random sampling strategy

IES

Other MIC

Other MIC

IES

Mac Destinated Sequences (MDS)

MAC

TA Junction

  • Most sequences are Mac Destinated Sequences (MDS) : We cannot guess their nuclear origin
    • A vast majority of them (>99.5%) comes from the MAC

Expected number of IES+ sequences

  • 1 molecule out of 200 comes from the MIC
    • A bit less due to contaminants
  • ~ 1/6 of MIC inserts will carry an IES
  • # of PacBio consensus (CCS) per sample :
    • > 150.000 (multiplexed)
    • ~ 350.000 PacBio CCS (not multiplexed)

That is,

  • Expected ~100 to 300 IES+ sequences per experiment
    • Got 49 to 310

Orders of magnitude :

This is not much, but if we are right 100% of the scnRNA independent IESs could be methylated

Raw numbers of PacBio consensus (CCS) per sample and category

IES retention

  • IESs are sometimes retained in the MAC

 

P(R)

1 - P(R)

  • Quantification: "IES Retention Score" (IRS)
    • MIRET : Cyril Denby Wilkes, Olivier Arnaiz, Linda Sperling 2016, eg:

5 reads IES -

1 read IES +

2 reads IES +

$$IRS_L = \frac{2}{2+5} \approx 27\%$$

$$IRS_R = \frac{1}{1+5} \approx 16\%$$

The higher the IRS, the higher the retention.

Danger !

IES retention

Pitfall

e.g

MIC = 4n, MAC = 800n, R = 1/200 = 0.005 , N = 100 NGS reads

$$\mathbb{E}(IRS)= 0$$

$$P(MIC|IES+) = 50\%$$

No !

Even a low IRS can be problematic for us !

When the N is small (~100), it's just impossible to see small retention levels

Due to the MAC ploidy, even the slightest retention leads to $$P(MIC|IES^+) < P(MAC|IES^+)$$

Let's just keep all IESs with an IRS = 0 ?

IES retention

Proposed approach

??

  1. Possible to compute R for each IES
    • Use it to compute P(MIC|IES+)
  2. Give a confidence interval to both R and P(MIC|IES+)

IES retention

Comparison / Benchmarks

$$P(MIC|IES+) \in [9.5\% - 93.7\%], \alpha = 5\%$$

Problem : The size of confidence intervals is very big

  • e.g 150 NGS reads, 0 IES+

For most IESs, we will simply not be able to tell wether it comes from the MAC or the MIC

IES retention

Workaround : Pooling samples

Rare picture of Eric, doing some archeology to find more samples to pool and gain coverage (circa 2022, colorized)

  • 12 vegetative samples
  • Coverage reached : 1160 - 1660X !
  • Implicit assumption : R is mostly IES-dependent

Mac ploidy : 800n ?

If MAC ploidy = 800n than without retention :

$$E(IRS) = \frac{4}{800+4} \approx 0.005$$

If retention :

$$E(IRS) >> \frac{4}{800+4}$$

Something is odd !

0.002-0.003

Mac ploidy ... At least 1600n ?

Mac ploidy ... At least 1600n ?

For our calculus, we used K = 1600n

Numerical application

When using :

  • Kmac = 1600n
  • 12 vegetative samples to quantify R and P(MIC|IES+)
  • ...

On average, we will still have only very vague estimates of P(MIC|IES+) !

On average on all the IESs without retention :

$$R = 0 \implies P(MIC|IES^+) \in [33\% - 100\%]$$

CONCLUSION :

Numerical application

On a fake dataset (simulated retention)

Real retention R (simulated)

P(MIC|IES+)

Summary

Can we identify the MIC sequences ?

  • Very hard to have a correct estimate of P(MIC|IES+)
  • When we will, it will mostly be for heavily retained IESs anyway
  • With even the slightest retention :

     
  • Only works for IESs, not other MIC
  • Only works when the ploidy is well caracterized

> We should expect lots of our MIC data to be impossible to use

$$R > 0 \implies P(MIC|IES^+) \approx 0$$

The guiding thread

  1. Identify MIC vs MAC molecules


     
  2. DNA methylation detection :
    • Implementation and test of an analysis pipeline
  3. 6mA MAC
  4. 6mA MIC

(+ Found a new way to quantify IES retention)

(+ re-estimated the MAC ploidy)

DNA methylation analysis pipeline

A few months of plumbing later...

Outputs scores :

"ipdRatio": 2, 1, 4...

"ModificationQv" : 0,  30,  25...

"identificationQv": 1,  0, 50...

When do we call a nucleotide methylated ?

Pipeline output

Using E. coli DNA

  • Nearly 100% 6mA (symmetrical):
    • GATC +++
    • EcoK
    • A few others
      • Depends a lot of the strain

  • Nearly 0% 6mA :
    • Everything else

E.coli is used to feed paramecium (contaminants)

> Can be used to benchmark our pipeline's ability to detect 6mA

ipdRatio in E. coli

  • The nucleotides we expect to be methylated have a high ipdRatio
  • Slight changes between motifs
  • Some exceptions : really not-methylated ?

Motif effect

Coverage

L. methylated

L. unmethylated

> 50

35-50

25-35

15-25

0-15

ipdRatio in E. coli

Coverage effect

How to binarize the ipdRatio ?

Either a nucleotide is methylated, or it is not :

  • We need to use a threshold on the ipdRatio to call modified nucleotides
  • This threshold has to take account of the coverage effect
  • No optimal solution anyway

Our pragmatical solution : An arbitrary linear threshold

Benchmark (6mA)

  • ~92% of 6mA in EcoK and GATC
  • ~99.8% of non-6mA elsewhere

If we make the simplification that all GATC/EcoK sites are methylated and that 6mA is only present there   :

$$Sensitivity = P(D|M)$$

$$Se  = 92\%$$

But :

  • Some GATC/EcoK are unmethylated
    • The real Se is actually better than 92%
  •  A few amount of 6mA outside of GATC/EcoK site
    • The real Sp is actually better than 99.8%
  • Se = 92% and Sp = 99.8% are worst case estimates

$$Specificity = P(\overline{D}|\overline{M})$$

$$Sp = 99.8\%$$

Remark on hemi-methylation

We can easily have more than 50% of so-called hemi-methylated sites that are actually not hemi-methylated

Quantifying hemi-methylation is tricky if $$Se < 100\%\ and\ Sp\ < 100\%$$

Benchmark

(other modifications)

PacBio sequencing was already known for its propensity to generate false positives for 4mC (K. O’Brown et al. 2014)

 

Qv30

  • Either false positives, or a Nature paper
  • "Killer experiment": See with amplified DNA
  • For now, we ignored it

Analysis Pipeline

summary

Great to detect 6mA

Other ??

Potential problems when quantifying hemi-methylation

The guiding thread

  1. Identify MIC vs MAC molecules


     
  2. DNA methylation detection :
    • Implementation and test of an analysis pipeline
  3. 6mA MAC
  4. 6mA MIC

(+ Found a new way to quantify IES retention)

(+ re-estimated the MAC ploidy)

DNA modifications

In the MAC

Hallmarks in HTVEG


• Between 1.25% and 1.45% of 6mA in the MAC
• Between 97.39 and 100% of them are located in
AT sites

Taking account of the uncertainty of Se and Sp :

Problem : Some results will vary greatly depending on Se and Sp !

The 4 scenarii for Se and Sp

  • We don't care about Se and Sp
  • Only thing that matters : Does it impact the result ?
Se Sp Interpretation Scenario Number
100% 100% Perfect 1
 
100% 99.8% Sometimes it misses 2
 
92% 100% Sometimes it invents 3
 
92% 99.8% A few confusions here and there 4

Global level of 6mA in the MAC

 > Implied in the bulk of 6mA in the MAC

Reduction of 6mA

P. tetraurelia

Mitochondrial

Total BET

-80 to -90% NM proteins

-60 to 70% MT proteins

We expected -90%

Get only -50% ??

Less than predicted by southwestern blot

Role of our candidates

Raise of hemi-methylation, whose intensity depends importantly on how well Se and Sp are well estimated or not

Fraction of AT sites that are hemi-methylated

Role of our candidates

Role of our candidates

The capacity to make symmetrical methylation is never abolished completely

Role of our candidates

De novo methylation of unmethylated AT sites :

Unchanged NM4.

Drops everywhere else

Proposed interpretation/Hypothesis

Role of our candidates

All weakly implied except NM4 (not implied)

All strongly implied

Never abolished

Function = Symmetry +++

Outside of AT sites

Predicted FDR : 100%

But likely detection outside of AT sites too

never erased

 

AGAA and GAGG motif
are documented as methylated sites (6mA) in C. el-
egans too (Greer et al. 2015)

The guiding thread

  1. Identify MIC vs MAC molecules


     
  2. DNA methylation detection :
    • Implementation and test of an analysis pipeline
  3. 6mA MAC
  4. 6mA MIC

(+ Found a new way to quantify IES retention)

(+ re-estimated the MAC ploidy)

DNA modifications

In the MIC

A ruthless data shrinkage

Number of molecules with at least one exploitable adenine

- several IESs

- variable MAC regions

- extremity outliers

Is our calculus valid for several IESs ?

Remaining with computable P(MIC|IES+)

P(MIC|IES+) among the surviving molecules

The vast majority of IES+ molecules come actually... From the MAC !!!

A few MAC molecules concentrate the majority of 6mA

A few MAC molecules concentrate the majority of 6mA (2/2)

We can never exclude the hypothesis that 6mA comes from the MAC

Conclusion on the MIC

  • We cannot confirm of invalidate the hypothesis that there is 6mA in the MIC
  • Therefore, we cannot conclude about the role of 6mA as a permanent marker of IESs in the MIC of P. tetraurelia

Summary

Methodological developments :

  • Analysis pipeline
  • Retention / P(MIC|IES+)
  • MAC ploidy > 1600n
  • [ Hemi-methylation ]

 

On our MAC data :

  • Methylase candidates = active in the bulk of methylation in the MAC
  • Symmetrical methylation of AT sites + a few de novo hemi-methylation
  • AGAA / GAGG outside of AT sites

 

On our MIC data:

We cannot exclude the hypothesis that all 6mA comes from the MAC

Perspectives

  • P(MIC|IES+) when > 1 IES ?
    • Could invalidate the hypothesis definitely
  • Deeper analysis of HT2 and HT6
    • P(MIC|IES+) is irrelevant here
    • Still possible to find 6mA or other modifications around the IES very transiently
  • Check 6mA / TSS
  • Hemi-methylation :
    • TSS ?
    • Maintained through replication ?

 

On these data :

For the future :

  • MIC purification +++
  • Nanopore ?
  • Partial purification ?
  • The random sampling is doomed to fail in P. tetraurelia

MERCI !

Team Meyer

Team Genovesio

SPIBENS

Les infos et admin.

Jury: Chunlong, Sandra, Eric, Laurent

Comité de thèse: Linda, Mireille, Chunlong

La communauté paramécie

Les petites mains derrière les données

Mes collègues de pause

Toi public

 

Mentions spéciales contributions directes :

Eric Meyer

France Rose

Mathieu Bahin

O. Arnaiz

Leandro

MTA1 -- orthologue 4-9-10

MTA9 -- Pas catalytique chez Tetrahymena

[..] --> MT1A1B2

Pas 6mA tetrahymena MIC

MTA1 -- orthologue 4-9-10

MTA9 -- Pas catalytique chez Tetrahymena

[..] --> MT1A1B2

Pas 6mA tetrahymena MIC

The 4 scenarii for Se and Sp

  • We don't care about Se and Sp
     
  • We care about the fact that eventual mis-estimations of them doesn't really change anything

Finding unbiased estimators for    and FDR

If p number of positive detections among N tests:

 

p = FP + TP

$$\pi$$

So,

Which means

And:

What it gives in Paramecium

Methodological development to correct hemi-methylation detection (1/2)

Let FD1 and FD2 be resp:

  • Fraction of hemi-methylated AT sites
  • Fraction of sym-methylated AT sites

    PZ0, PZ1, PZ2:  unbiased estimators of non, hémi, symetrically methylated AT sites

Then:

With

Methodological development to correct hemi-methylation detection (2/2)

We can also find the number of hemi-methylated sites being detected as such, and the proportion of sites detected as hemi-methylated that are really hemi-methylated. This is possible because we now approximately know PZ0, PZ1 and PZ2, and P(D|Z) is easy to determine:

Then, P(Z|D) can be determined through Bayes theorem using P(D|Z), P(Z) and P(D) (which are all known)

P(Z=1|D=1)

is our case of interest

What it gives in Paramecium

2.1 Retroingineering

The capping of IPDs


 

  • modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position

  • globalIPD is the mean of all the IPD values of the read.

  • localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences

Conclusion on the capping

  • Isn't coded as advertized by PacBio
  • The way it's implemented for AggSN is problematic and doesn't really make sense
  • Paradoxally, it should be more relevant for our approach than for the default one
  • We expect no methylation to be undetected due to the capping

 

Laura landwebehr 2020

Oxytrichia trifallax

A outAT score 20 isQv20 (812 seq)

A outAT score20 idQv20 + Strong BH correction (176 seq)

PhD defense - DDay

By biocompibens

PhD defense - DDay

28/02/19

  • 103