Role of DNA methylation in the genomic rearrangements in P. tetraurelia

PhD Defense

DELEVOYE Guillaume

09/06/2022

Supervisor : Dr MEYER Eric

Jury members : Dr DUHARCOURT Sandra, Dr CHEN Chunlong, Dr DURET Laurent

Introduction

Part 1 : Transposable Elements (TEs)

1948 : Jumping genes

Barbara McClintock discovers the As/Dc elements in maize

 

Nobel prize (1983)

 Ds jumps in presence of Ac (non-autonomous)

Historical classification

(Finnegan)

"Copy paste" versus "Cut and paste"

Historical understanding

Junk/Selfish DNA

TEs were first considered as :

  • "Selfish DNA"
    • They do not perform any "function" for their host
  • Neutral or mildly deleterious
    • That is, evolutionary burdens
 

TEs are conserved not because they provide an additional fitness to their host, but despite the fact that they don't. This is the non-phenotypic selection.

Doolittle, Orgel, Crick and Sapienza (1980)

The same day in Nature journal

Modern taxonomy

Wicker et al. 2007

  • 3 classes
  • 9 orders
  • 29 superfamilies

Based on:

  • Mechanistic / Enzymatic criteria
  • Structural data

The cut-paste versus copy-paste comparison turned out to less relevant over time

TEs are ubiquitous

 

Modern understanding

  • TEs Generate selection :
    •  Purifying:           growth rate = Genome-wide invasion
    •  Adaptative: Exon shuffling, transcription regulation ++,
        exaptation (e.g Rag1 & Rag2 in mammals), etc.
  • Are transferred vertically (++) and horizontally (HTT)
    • Several thousands known HTT
    • P-element invaded Drosophilia worldwide in less than 100 years !
  • Are present in all cellular organisms
  • Probably exist since the ~ begining of the cellular life

 

The paradigm is currently shifting from "Junk DNA" to "Major actors of evolution".

$$\sim2^N$$

Regulation mechanisms

> Any particulary vulnerable organisms has been wiped out in the past

> Logical conclusion :  All remaining life forms have some kind of resilience towards TEs.

> All virulent TEs have wiped out their host (and disappeared with them)

Many epigenetic regulations exist :

  • 5-methylcytosine (5mC) silences TEs in H. Sapiens
  • 6mA in Drosophila
  • piRNA in animals
  • 5mC and H1 Histone methylation in Arabidopsis
  • ...

P. tetraurelia has an original way of dealing with TEs

Introduction

Part 2 : P. tetraurelia

Studying ciliates

A 350 years old story

A. van Leeuwonhoek (1668)

"Animalcules"

Pasteur (1862)

Spontaneous generation

HS Jennings (~ 1900)

Paramecium as a model

T. Sonneborn (1937)

Non-mendelian inheritance

of sexual type

in Paramecium

Carol Greider &

Elizabeth Blackburn

Telomeres (1985) - Nobel prize

Meyer and Duharcourt (2014)

Sexual type is inherited via maternal RNAs, in Paramecium

There is more...

  • First known organisms that do not use the "universal" genetic code
    • Paramecium (Caron and Meyer 1985)
    • Tetrahymena (Preer et al. 1985)
  • Histone Acetlyases (HAT)
    • Tetrahymena (Brownell et al. 1996)
  • Self-splicing introns (ribozymes)
    • Tetrahymena
  • Tubulin post-translational modifications

> The genome-wide programmed rearrangements <

P. tetraurelia: Nuclear dimorphism

Unicellular eucaryote with 3 nuclei:

  • 2xMIC nuclei (2n)
    • Germline nucleus
    • Contains: TEs + 49.260 Internal Excised Sequences (IES)
    • No transcription outside meiosis
  • 1xMAC nucleus (up to 800n)
    • Somatic nucleus
    • Amplified version of the MIC
    • Free from TEs and IESs
    • Transcriptionnally active
    • Rebuilt from a MIC after each sexual process, under the control of the maternal MAC (hence frequent non-mendelian inheritance)

DNA ratio: 1 MIC for 200 MAC

Programmed rearrangements

A new MAC is formed from a MIC, with important genome re-arrangements

Results in a MAC DNA almost purely made of coding sequences

Coyne et al. 2012

Profiling IESs (1/2)

  • Non-coding
  • Excised after sexual processes with a single-nucleotide precision
  • Relationship with TEs :
    • IESs = Degenerated remnants of Tc1/Mariner elements
    • Excised by a domesticated PiggyBac transposase (Pgm)
  • Most recent IES are very short (< 27bp), and are the majority of IESs

 

49.260 unique sequences

O. Arnaiz et al 2012

PiggyMac (Pgm)

Profiling IESs (2/2)

  • 100% TA-Bounded
  • Weak consensus TAYAG
    • Degenerated TC1-Mariner TE insertion site
  • Periodic size distribution

O. Arnaiz et al. 2012

This is not sufficient for the cell to distinguish IESs from the rest of the genome

IES recognition: sc-RNA pathway

E. Allen and M. Nowacki - 2017

Problematic

Problematic in a nutshell

  • ~30% of IESs only are small-ncRNA dependant
    • What about the majority remaining ?
  • How does the cell recognize IESs ?
    • Especially, the scnRNA independent ones
    • ~ Self VS non-self recognition
  • MIC/MAC ploidy ratio = Challenge +++

 

 

 

 

 

The DNA methylation hypothesis

6mA abundant in Paramecium:

 

  • 2.5% in the MAC and MIC of P. aurelia (Cummings et al. 1975)

  • Detection by SMRT in the MAC (Hardy et al.)
  • Detection in Oxytrichia by L. Landweber et al. (2019)
     
  • no 5mC, 4mC in the MAC a priori

 

2) In the new forming MAC

1) Constant pattern in the MIC

Transcient ?

The DNA methylation hypothesis

And many other possibilities...

Methylase candidates

  • In 2015, DAMT-1 in C. elegans (6mA) - Preer et al.
  • MTA-70 domain of DAMT-1 identified in P. tetraurelia too
    • Silenced by RNA interference (Grouped by homology)
    • Sometimes, reduction of 6mA (southwestern blot)
Silencing Objective Target of interest Location of interest
None WT methylation (MIC and MAC) 6mA + ? MIC and MAC
Control gene Control 6mA + ? MIC and MAC
None Pattern right before the excision ? 6mA + ? new forming MAC
None Pattern right before the excision ? 6mA + ? new forming MAC
NM4 Bulk of 6mA 6mA MAC ++
NM9 + NM10 Bulk of 6mA 6mA MAC ++
NM4 + NM9 + NM10 Bulk of 6mA 6mA MAC ++
MT1A Permanent pattern erased ? ? MIC
MT1A + MT1B Permanent pattern erased ? ? MIC
MT1A + MT1B + MT2 Permanent pattern erased ? ? MIC

> Sequenced with PacBio SMSN sequencing

Objectives and overview

  1. Code the analysis pipeline
    • Benchmark it (E. coli)
  2. Sort what comes physically from the MAC and MIC
    • Take acount of IES retention
    • Re-estimation of the MAC ploidy ~ O(1600n)
  3. Analyze the DNA modifications in the MAC
    • Statistical pitfall when studying hemi-methylation
  4. Analyze the DNA modifications in the MIC
  5. Conclude on the role of our methylase candidates
  6. Conclude on the role of DNA methylation in the recognition of the small-ncRNA independent IESs

Methods vs Application

Methodological results

Part I : Analysis pipeline

PacBio SMSN sequencing

$$ipdRatio= \frac{MeanIPD_{experience}}{unmethylated\ control}$$

~ 85% accuracy

~ 100% accuracy

Slowing around modified nucleotides (~ time x100)

relevant\ only\ if\ \geq 25 IPDs

Global principle

PacBio SMSN sequencing

Expected output

An analysis for each nucleotide, on each strand, of each molecule (SMSN = Single-Molecule Single Nucleotide)

Possible detection : 4mC, 5mC, 6mA, "other"

SMSN-seq is like mixing milk and cereals : There are two ways of doing it

$$ipdRatio= \frac{MeanIPD_{experience}}{control}$$

A) Control = Whole Genome Amplified (WGA)  DNA

  • Real-world values
  • Analysis pipeline exists   (J. Beaulaurier et al. 2015)

B) Control = Machine-learning (nucleotide context)

  • Invented values
  • No pipeline exists

 

Then

First

a.k.a The best way ™

Then

First

A few months of plumbing later...

Using E. coli DNA

  • Nearly 100% 6mA (symmetrical):
    • GATC +++
    • EcoK
    • A few others
      • Depends a lot of the strain
         
  • Nearly 0% 6mA :
    • Everything else

E.coli is used to feed paramecium (contaminants)

ipdRatio in E. coli (1/2)

  • The nucleotides we expect to be methylated have a high ipdRatio
  • Some exceptions : False negative or really not-methylated ?

ipdRatio in E. coli (2/2)

Separability and coverage are correlated

How to binarize the ipdRatio ?

Either a nucleotide is methylated, or it is not :

  • We need to use a threshold on the ipdRatio to call modified nucleotides
  • This threshold has to take account of the coverage effect
  • No optimal solution anyway

Our pragmatical solution : An arbitrary linear threshold

Benchmark (6mA)

  • ~92% of 6mA in EcoK and GATC
  • ~99.8% of non-6mA elsewhere

If we make the simplification that all GATC/EcoK sites are methylated and that 6mA is only present there   :

$$Sensitivity = P(D|M)$$

$$Se  = 92\%$$

But :

  • Maybe some GATC/EcoK are tuly unmethylated
    • In this case, the real Se is actually better than 92%
  • Maybe there is a few amount of 6mA outside of GATC/EcoK site
    • In this case, the real Sp is actually better than 99.8%
  • Se = 92% and Sp = 99.8% are worst case estimates

$$Specificity = P(\overline{D}|\overline{M})$$

$$Sp = 99.8\%$$

Benchmark

(other modifications)

PacBio sequencing was already known for its propensity to generate false positives for 4mC (K. O’Brown et al. 2014)

 

  • It is very likely that most of these detections are false positives

Objectives and overview

  1. Code the analysis pipeline
    • Benchmark it (E. coli)
  2. Sort what comes physically from the MAC and MIC
    • Take acount of IES retention
    • Re-estimation of the MAC ploidy ~ O(1600n)
  3. Analyze the DNA modifications in the MAC
    • An important statistical pitfall when studying hemi-methylation
  4. Analyze the DNA modifications in the MIC
  5. Conclude on the role of our methylase candidates
  6. Conclude on the role of DNA methylation in the recognition of the small-ncRNA independent IESs

Methods vs Application

Methodological results

Part II : Sorting + IES retention

The random sampling strategy

IES

Other MIC

Other MIC

IES

Mac Destinated Sequences (MDS)

MAC

TA Junction

  • We work on total DNA :
    • ~ 1 molecule out of 200 comes from the MIC
    • Sometimes the physical origin of the molecule can be guessed
    • MIC regions that are far from the IESs and the MIC-specific sequences cannot be studied

Expected number of IES+ sequences

  • 1 molecule out of 200 comes from the MIC
    • A bit less due to contaminants
  • ~ 1/6 of MIC inserts will carry an IES
  • # of PacBio consensus (CCS) per sample :
    • > 150.000 (multiplexed)
    • ~ 350.000 PacBio CCS (not multiplexed)

That is, ~100 to 300 IES+ sequences per experiment


  • We got 49 to 310
  • If our hypothesis is true, 100% of them are methylated

Orders of magnitude :

IES retention

Methodological results

Part III : The MAC ploidy

Is the MAC ploidy 800n ?

MAC ploidy

The 4 scenarii for Se and Sp

  • We don't care about Se and Sp
     
  • We care about the fact that eventual mis-estimations of them doesn't really change anything

Finding unbiased estimators for    and FDR

If p number of positive detections among N tests:

 

p = FP + TP

$$\pi$$

So,

Which means

And:

What it gives in Paramecium

Methodological development to correct hemi-methylation detection (1/2)

Let FD1 and FD2 be resp:

  • Fraction of hemi-methylated AT sites
  • Fraction of sym-methylated AT sites

    PZ0, PZ1, PZ2:  unbiased estimators of non, hémi, symetrically methylated AT sites

Then:

With

Methodological development to correct hemi-methylation detection (2/2)

We can also find the number of hemi-methylated sites being detected as such, and the proportion of sites detected as hemi-methylated that are really hemi-methylated. This is possible because we now approximately know PZ0, PZ1 and PZ2, and P(D|Z) is easy to determine:

Then, P(Z|D) can be determined through Bayes theorem using P(D|Z), P(Z) and P(D) (which are all known)

P(Z=1|D=1)

is our case of interest

What it gives in Paramecium

2.1 Retroingineering

The capping of IPDs


 

  • modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position

  • globalIPD is the mean of all the IPD values of the read.

  • localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences

Conclusion on the capping

  • Isn't coded as advertized by PacBio
  • The way it's implemented for AggSN is problematic and doesn't really make sense
  • Paradoxally, it should be more relevant for our approach than for the default one
  • We expect no methylation to be undetected due to the capping

 

Laura landwebehr 2020

Oxytrichia trifallax

A outAT score 20 isQv20 (812 seq)

A outAT score20 idQv20 + Strong BH correction (176 seq)

PhD defense (long)

By biocompibens

PhD defense (long)

28/02/19

  • 100