Bahin Mathieu
Bioinformatics
Co-encadrant
Eric Meyer - PI
Biology - Parameciology - Precisiology
Suzanne Marques (PhD student), Veronique Tanty (IE - PhD student),
Sophie Malinski (MCU),
Marc Désir (Intern)
Eric's Team:
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
+ DIfficult to purify the MIC DNA
Up to 0.3 cm !
The canonical excision is variable and sometimes excision errors happen
Arnaiz O et al. The Paramecium Germline Genome Provides a Niche for Intragenic Parasitic DNA -2012
Unlike other ciliates, lots of IES locate inside coding sequences in Paramecium
Suppression of TE/IES in the MAC: Avoids the negative effect of TE and IES
Requires a very precise excision
~ 100% PGM-dependant excision
PiggyMac
IES
10% of IES are scanRNA dependant.
30% of IES are small-RNA dependant (shown by DICER-like silencingà
One big question remains: How does the cell recognize the Scan-RNA independant IES ? --> vital issue for the cell
2 independant hypothesis:
n6mA is one the the 3 most frequent known methylations in DNA
6mA likely to be abundant in Paramecium:
Suspected 2.5% in P. aurelia by Cummings et Al (1975)
Other DNA modifications are also candidates for our hypothesis
25x 250x 25x
RNA silencing
Candidates:
99% accuracy
Max
75% accuracy
Methylation analysis needs local coverage> 25X
Reads up to 80 kbp
Pooled analysis of different molecules (default - Aggregation)
Analyze every molecule independantly from the others
(Needs shorter inserts - multiple passes)
Beaulaurier et Al 2015
Within the total DNA:
SMSN implemented on Github by J. Beaulaurier
But:
Using SMSN in our case is mandatory but also new, and requires to develop our own tools
--> "true_smrt" python package
PacBio sequencing
Wild type:
Silencing of methylase candidates:
Also: AggSN of PGM32
Many categories:
IES retained 1% of times in the MAC
=
2x chances of comming from MAC than MIC when fished
(Reminder: MAC/MIC = 200:1)
To avoid this
I call the sequences that passed this filter "TRUE_MAC_IES"
Inputs:
Outputs:
+ Applying the retention filter divides the number of "MAC_IES" approximately by 50%
Only works for >99% accuracy inserts
We consider it done ! :)
Main problems encountered:
After rigorous checking, I stated that the capping as implemented was not a problem for our SMSN approach
Recently, I have coded ipdtools. It predicts:
Problems that we solved:
"smrtrue"
Private Github repository
For n6mA, PacBio produces:
The scores are PHRED-transformed p-values
Typical covscore plot
Modification score / coverage
Using flat threshold on modification score = Hudge lack of power
We implemented two thresholds that are function of the coverage:
In E. coli, from litterature:
1 to 2% DNA-n6mA
Almost 100% of the methylation is located in the GATC sites
Some methylation outside GATC is maintained by specific motif-driven enzymes
Almost all GATC are symetrically methylated
We could verify it :
Analyze E. coli data should reasure us and give us insights of our specificity and sensitivity
Identification assessment
Methylation outside GATC sites
Modification assessment: Score20
Methylation outside GATC sites
Modification assessment: Linear equation
What we should take care of when analyzing Paramecium's data
We measure methylation in E. coli
We have a no-so-bad test:
99% Sensitivity = P(D | M)
99% Specificity = P(ND | NM)
Our global fraction of n6mA among all adenines is around 1%
1.98%
~50%
Half of our detections will be false positive
P(A|B) != P(B|A) unless P(A) = P(B)
Ex: Here we measure 1.58% for a real proportion of 1%
Let's push it a little forward...
In E. coli, the litterature tells us that we can have:
What happens then ?
Preliminary data show that:
Which means:
For equal scores/p-values, we don't have the same probability of making a mistake at calling it methylated wether weit is located in an AT with one modification clearly identified or not
Beaulaurier 2015/2018:
In an ideal world:
p-value
SMSN n6mA testing
Prior knowledge of FDR
Assessed with Se, Sp and A priori on litterature
FDR control based on A priori knowledge
+
Likelihood ratio between different hypothesis
???
Enfin !
~95% of the methylation locates in AT dinucleotides in the MAC
slightly lower in the MIC (5 to 5 points less)
True in any condition
75% of the methylation in an AT dinucleotide is actually symetrically modified, independantly from being in the MAC or the MIC
Kept in MIC and MAC
All conditions
Kept in the MAC for all experimental conditions
Impossible to tell in the MIC (not enough sequences)
Logo from HTVEG MAC
Identical everywhere
Our tools can give relevant results
Interesting results:
Next comming: Compare the methylation of the scan-RNA independant IESs with scan-RNA dep
Short-term
Mean-term
Later
Sorry for the headaches !! Thanks for your time :)
modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences
Conclusion on the capping
IESs (1 & 2)
MAC (1 & 2)
Logo from HTVEG MAC
Identical everywhere
Laura landwebehr 2020
Oxytrichia trifallax
Modest variations in the silencing
Logo from HTVEG MAC
Identical everywhere
What interests us most
We didn't find no difference between experiments
The old MAC (maternal) drives everything during the formation of the new one:
= Non-mendelian / Cytoplasmic heridity
A outAT score 20 isQv20 (812 seq)
A outAT score20 idQv20 + Strong BH correction (176 seq)
ipdRatio out GATC before filtering BH vs after (qv20/idqv20)