Guillaume DELEVOYE
E. Meyer Lab
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
+ DIfficult to purify the MIC DNA
45.000 Unique sequences
30% of IESs only are small-ncRNA dependant (shown by DICER-like2-3 silencing)
6mA likely to be abundant in Paramecium:
Suspected 2.5% in the MAC of P. aurelia by Cummings et Al (1975)
Also documented 6mA in the MIC
2) In the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Methylation analysis needs local coverage> 25X
Reads up to 80 kbp
ENLEVER 10-20 KB
Candidates:
2) PacBio sequencing with short inserts | 6mA ++
Also at hand:
1) Grouped silencings by sequence homology
* t0 is defined as the time where 50% of cells entered meiosis
99% accuracy
Max
75% accuracy
Because our inserts are circular and shorts, we can make CCS of high accuracy despite a 15% error-rate in the subreads
Changer : 15% d'erreur
Deduced origin
MIC DNA
Alignment of consensus
TA
TA
TA
ADDRESSED UP TO 400X --> NON
Deduced origin
MIC DNA
Alignment of consensus
Only a few remaining: ~ 10 to ~200 sequences
100% should carry a methylation pattern in our first work hypothesis
PLOIDIE : Pas ~100% mais 99.5%
préciser proportion de séquences
"genome"
Total Paramecium
Total sequencing
At the end of the day:
100K to 250K reads with decent and exploitable consensus
~ 1.000 pure MIC + ~ 100-200 MIC with IES
Among the remainings we also find:
+ Applying the retention filter divides the number of "MAC_IES" approximately by 50%
"positive detection"
=
score > linear thershold, function of coverage
How good is it ??
Définir QV
ipdRatio
H0: ipdRatio of umethylated-Adenines
H1: ipdRatio of 6mA
S: Threshold on the pvalue
--> Specificity
--> Sensitivity
$$\alpha$$
$$1 - \beta $$
Adenine
6mA
ln(ipdRatio) ~ N(0,1)
log(ipdRatio)
A pvalue is just the likelihood that log(ipdRatio) is in the tail of H0 when H0 is true
Separability raises with coverage, which is expected
...but are not ideally distributed
Ideal pvalues
--> Allows magic !
PacBio's
$$\pi_{0}, \pi_{1}$$
All Adenines' pvalues [E.coli] coverage > 40X
For n6mA, PacBio produces:
They are PHRED-transformed p-values of two different statistical tests, that rely on the mean of the IPDs
The scores (Qv) are PHRED-transformed p-values
Typical covscore plot
Modification score / coverage
Using flat threshold on modification score = Hudge lack of power
E.coli:
How good (or bad) is our method ?
$$Se = P(D^+|M) ~ 92\%$$
$$Sp = P(D^-|NM) ~ 99.9\%$$
...But only If we suppose that:
If p number of positive detections among N tests:
p = FP + TP
So,
Which means
And:
~95% of the methylation locates in AT dinucleotides in the MAC
True in any condition
75% of the methylation in an AT dinucleotide is actually symetrically modified
Kept in MIC and MAC
All conditions
95% --> C'est faux ?
Present in all samples
Never erased
Conclusion: We are either in scenario 1 or 3
(Sp largely underestimated)
~0.1% of the adenines located outside of AT sites might be n6mA too
How can we estimate P(M=i | D=j) ?
Let FD1 and FD2 be resp:
Then:
With
We can also find the number of hemi-methylated sites being detected as such, and the proportion of sites detected as hemi-methylated that are really hemi-methylated. This is possible because we now approximately know PZ0, PZ1 and PZ2, and P(D|Z) is easy to determine:
Then, P(Z|D) can be determined through Bayes theorem using P(D|Z), P(Z) and P(D) (which are all known)
P(Z=1|D=1)
is our case of interest
No FP
Some n6mA are missed
Perfect detections
No FP
Some n6mA are missed
Perfect detections
~ 0.4% in the vegetative MIC
~ 0.4% in the vegetative MIC
Number n6mA
If we look at some of them:
* Located in AT
* Symetrical, except in NM4-9-10
It's still possible that Scan-RNA IES are 100% methylated, even if all other sequences with n6mA actually come from the MAC
Still ongoing
In the vegetative MAC
Effects of NM4-9-10 on hemi-methylation in the MAC :
In the vegetative MIC
During autogamy
We didn't even start to investigate
I have plenty perspectives in mind we can discuss, but I'd be glad to hear your ideas about it too :)
Pure MIC sequences: Cannot yet be trusted (error in the pipeline)
Coverage >= 40X made us lose too many materials, should restart with >=20X
modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences
Conclusion on the capping
Laura landwebehr 2020
Oxytrichia trifallax
A outAT score 20 isQv20 (812 seq)
A outAT score20 idQv20 + Strong BH correction (176 seq)
ipdRatio out GATC before filtering BH vs after (qv20/idqv20)