Studying DNA-6mA in Paramecium tetraurelia with PacBio sequencing

Guillaume DELEVOYE

E. Meyer Lab

P. tetraurelia: Genomic architecture

Unicellular eucaryote with 3 nuclei:

2xMIC nuclei (2n)
- Germline nucleus
- Contains: TE & IES
  - No transcription outside meiosis
1xMAC nucleus (up to 800n)
- Somatic nucleus
- Amplified and "fixed" version of the MIC
- Free from TE and IES
- Transcriptionnally active

DNA ratio: 1 MIC for 200 MAC

+ DIfficult to purify the MIC DNA

Profiling the IESs

Non-coding
Excised after sexual processes
Remnant of TE ?
Most recent IES are very short (< 27bp), and are the majority of IESs

45.000 Unique sequences

How are IESs recognized ?

(1/2)

100% TA-Bounded
Weak consensus TAYAG
- Degenerated TC1-Mariner TE insertion site
- Not sufficient
Periodic size distribution

30% of IESs only are small-ncRNA dependant (shown by DICER-like2-3 silencing)

What about the majority remaining ?

How are IES recognized ? (2/2)

The sc-RNA pathway

IES recognition:

The DNA methylation hypothesis

6mA likely to be abundant in Paramecium:

Suspected 2.5% in the MAC of P. aurelia by Cummings et Al (1975)
- Also documented 6mA in the MIC
Detected methylation by SMRT in the MAC in our lab
Detection by SMRT in the MAC by Sandra Duharcourt's lab
Detection in Oxytricha by L. Landweber et Al (2019)
no 5mC, 4mC in the MAC a priori

2) In the new forming MAC

1) Constant pattern in the MIC

Transcient ?

PacBio SMRT principle

Methylation analysis needs local coverage> 25X

Kinetic signature of n6mA
InterPulse Duration (IPD)
Compared
- either to WGA
- or Trained model (ML - "in-sillico control") - SImulation from context of nucleotides

Reads up to 80 kbp

We use the isn-sillico model

ENLEVER 10-20 KB

Methylase candidates

Candidates:

7 "DPPW motif IV" genes
- NM4, NM9, NM10
Another family: MT1a, MT1b, MT2

2) PacBio sequencing with short inserts | 6mA ++

Also at hand:

Vegetative WT cells (HTVEG)
Vegetative cells with a silencing of a control gene (MAB)
Cells that started autogamy (HT2 - 2 hour* , HT6 - 6 hour*)

1) Grouped silencings by sequence homology

* t0 is defined as the time where 50% of cells entered meiosis

Step 1: Consensus

99% accuracy

Max

75% accuracy

Because our inserts are circular and shorts, we can make CCS of high accuracy despite a 15% error-rate in the subreads

Changer : 15% d'erreur

Step 2: Sorting (1/2)

Deduced origin

MIC DNA

Alignment of consensus

Step 2: Sorting (2/2)

Quite rare
When it happens, can be amplified up to 400x
We addressed it by removing separating "True IESs" and "others"
Filtering based on the number of reads mapped on each IES, in 4 replicated vegetative samples

A few words about IES retention in the MAC:

ADDRESSED UP TO 400X --> NON

Step 3 : DNA methylation analysis

Single-molecule
Single-nucleotide resolution
- "SMsn" approach
Independant yet pairable analysis on both strands

Some more details:

the random sampling approach

Deduced origin

MIC DNA

Alignment of consensus

Only a few remaining: ~ 10 to ~200 sequences

100% should carry a methylation pattern in our first work hypothesis

1 out of 200 comes from the MIC
1/6 of MIC inserts will carry an IES
1/2 of IESs will be removed by our safety filtering
- (Eliminate potential retention)
~7-10% of the remnants are small RNA dependant (and most of them are only partly dependant)

PLOIDIE : Pas ~100% mais 99.5%

préciser proportion de séquences

Final categories

MDS
- = "MAC" for 99% of sequences
MAC_IES
- "true" MAC_IES (never seens retained)
- other MAC_IES (sometimes retained)
MIC
- Other MIC specific (TE, repeated sequences)
MAC (TA Junction)
- Overlap a TA junction of excision
rDNA
mtDNA
Other:
- Contaminants
- Alternative excision boundaries ("LOWID")
- low identity consensus ("Trash")

"genome"

Total Paramecium

Total sequencing

Orders of magnitude

1M holes available in the sequencer
~30% will have enough subreads to generate a CCS

At the end of the day:

100K to 250K reads with decent and exploitable consensus

~ 1.000 pure MIC + ~ 100-200 MIC with IES

Among the remainings we also find:

Misligations and chimeric reads
Adapter missed by the sequencer
Exceptionnaly low identity subreads
Contaminant, etc

MDS and MAC_TA_Junction represent a vast majority

MIC specific and IES are as rare as expected

+ Applying the retention filter divides the number of "MAC_IES" approximately by 50%

Detecting m6A in details

Detection of n6mA

"positive detection"

score > linear thershold, function of coverage

How good is it ??

Définir QV

ipdRatio

Detecting m6A optimally

H0: ipdRatio of umethylated-Adenines

H1: ipdRatio of 6mA

S: Threshold on the pvalue

--> Specificity

--> Sensitivity

$$\alpha$$

$$1 - \beta $$

Adenine

6mA

ln(ipdRatio) ~ N(0,1)

log(ipdRatio)

A pvalue is just the likelihood that log(ipdRatio) is in the tail of H0 when H0 is true

How log(ipdRatios) look in E.coli

Separability raises with coverage, which is expected

PacBio pvalues do have some biological meaning

PacBio pvalues do have a meaning...

...but are not ideally distributed

Ideal pvalues

--> Allows magic !

PacBio's

Allow estimation of
Allows optimal, adaptative FDR control

$$\pi_{0}, \pi_{1}$$

Just don't
Obvious point n°1:
- log(ipdRatio < 1) -> -inf

Normality assumptions under H0 are broken under high coverages

The higher the coverage, the worse (here, >40x)
Will cause bell-shaped pvalues under null
- Hidden phenomena on the previous curve

Normality assumption matters

All Adenines' pvalues [E.coli] coverage > 40X

Other PBio's scores for m6A

For n6mA, PacBio produces:

A modification score --> Slowing of the polymerase
- pvalue against H0 only (the one we presented earlier)
An identification score --> Kinetic signature of a modification
- loglikelihood between H0 and H1 (H1 = Other modifications or secondary peaks)

They are PHRED-transformed p-values of two different statistical tests, that rely on the mean of the IPDs

PHRED scores

The scores (Qv) are PHRED-transformed p-values

Typical covscore plot

Modification score / coverage

Using flat threshold on modification score = Hudge lack of power

Using E.coli as ground truth

E.coli:

Feeds paramecium: contaminants+++
In theory
- GATC
- EcoK
- Few others
- ... All contain n6mA on both strands
Outside of GATC and EcoK: very low levels of m6A

Long story short

How good (or bad) is our method ?

$$Se = P(D^+|M) ~ 92\%$$

$$Sp = P(D^-|NM) ~ 99.9\%$$

...But only If we suppose that:

All GATC and EcoK carry symmetrical n6mA
No other motif contains n6mA
GATC/EcoK will display similar scores as other motifs (--> +/- True)

How to be sure about our Se and Sp ?

We can't
We don't need to

The 4 scenarii for Se and Sp

We don't care about Se and Sp
We care about the fact that eventual mis-estimations of them doesn't really change anything

Finding unbiased estimators for and FDR

If p number of positive detections among N tests:

p = FP + TP

$$\pi$$

So,

Which means

And:

What it gives in Paramecium

Debiased levels of m6A in the MAC_TA inserts

Details in MAC HTVEG

~95% of the methylation locates in AT dinucleotides in the MAC
- True in any condition
75% of the methylation in an AT dinucleotide is actually symetrically modified

Kept in MIC and MAC

All conditions

95% --> C'est faux ?

MAC_TA outside of AT sites

Detections outside AT sites are likely to be at least partly true positives

Present in all samples

Never erased

Conclusion: We are either in scenario 1 or 3

(Sp largely underestimated)

This allows us to announce...

It is very, very likely that 1.4 to 1.6% of Adenines in the MAC are n6mA
~0.1% of the adenines located outside of AT sites might be n6mA too

The problem of hemi-methylation

How can we estimate P(M=i | D=j) ?

Methodological development to correct hemi-methylation detection

Let FD1 and FD2 be resp:

Fraction of hemi-methylated AT sites
Fraction of sym-methylated AT sites

PZ0, PZ1, PZ2: unbiased estimators of non, hémi, symetrically methylated AT sites

Then:

With

Methodological development to correct hemi-methylation detection (2/2)

We can also find the number of hemi-methylated sites being detected as such, and the proportion of sites detected as hemi-methylated that are really hemi-methylated. This is possible because we now approximately know PZ0, PZ1 and PZ2, and P(D|Z) is easy to determine:

Then, P(Z|D) can be determined through Bayes theorem using P(D|Z), P(Z) and P(D) (which are all known)

P(Z=1|D=1)

is our case of interest

What it gives in Paramecium

Debiased Hemi-methylation in MAC_TA

No FP

Some n6mA are missed

Perfect detections

Same conclusion !

No FP

Some n6mA are missed

Perfect detections

What about IESs ?

MAC IES (scenario 1 - perfection)

~ 0.4% in the vegetative MIC

MAC IES (scenario 3)

A few missed
No FP

~ 0.4% in the vegetative MIC

Should we get excited or not about those 0.4% ?

The repartition of n6mA accross molecules

Number n6mA

If we look at some of them:

* Located in AT

* Symetrical, except in NM4-9-10

Some/Many sequences are probably comming from the MAC

However,

It's still possible that Scan-RNA IES are 100% methylated, even if all other sequences with n6mA actually come from the MAC

About pure MIC sequences ?

Still ongoing

Conclusion

In the vegetative MAC

Confirmation of previous studies
New robust motif outside AT sites

Effects of NM4-9-10 on hemi-methylation in the MAC :

This tells an interesting story about the NM family, and the possible deposition mechanisms of n6mA

In the vegetative MIC

Impossible for now to know whether there is n6mA or not
If there is, it must be far lesser than in the MAC

During autogamy

We didn't even start to investigate

Perspectives

I have plenty perspectives in mind we can discuss, but I'd be glad to hear your ideas about it too :)

Additional

Title Text

Subtitle

Title Text

Subtitle

Title Text

Subtitle

Title Text

MAC IES (scenario 4)

MAC IES (scenario 3)

n6mA in the MIC (preliminary)

Pure MIC sequences: Cannot yet be trusted (error in the pipeline)

Coverage >= 40X made us lose too many materials, should restart with >=20X

2.1 Retroingineering

The capping of IPDs

modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences

Conclusion on the capping

Isn't coded as advertized by PacBio
The way it's implemented for AggSN is problematic and doesn't really make sense
Paradoxally, it should be more relevant for our approach than for the default one
We expect no methylation to be undetected due to the capping