Eric Meyer - Mathieu BAHIN
Bahin Mathieu
Bioinformatics
Eric Meyer
Biology - Parameciology
Suzanne Marques (PhD student), Veronique Tanty (IE - PhD student),
Sophie Malinski (MCU)
Eric's Team:
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
+ DIfficult to purify the MIC DNA
Up to 0.120 mm
IES
The canonical excision is variable and sometimes excision errors happen
Arnaiz O et al. The Paramecium Germline Genome Provides a Niche for Intragenic Parasitic DNA -2012
Unlike other ciliates, lots of IES locate inside coding sequences in Paramecium
Suppression of TE/IES in the MAC: Avoids the negative effect of TE and IES
Requires a very precise excision
~ 100% PGM-dependant excision
PiggyMac
IES
10% of IES are scanRNA dependant.
30% of IES are small-RNA dependant (shown by DICER-like silencing)
One big question remains: How does the cell recognize the Scan-RNA independant IES ? --> vital issue for the cell
2 independant hypothesis:
n6mA is one the the 3 most frequent known methylations in DNA
6mA likely to be abundant in Paramecium:
Suspected 2.5% in P. aurelia by Cummings et Al (1975)
25x 250x 25x
RNA silencing
Candidates:
99% accuracy
Max
75% accuracy
Methylation analysis needs local coverage> 25X
Reads up to 80 kbp
Pooled analysis of different molecules (default - Aggregation)
Analyze every molecule independantly from the others
(Needs shorter inserts - multiple passes)
Beaulaurier et Al 2015
SMsn
1:200 (MIC/MAC)
Random sampling
PacBio SMRT
150K to 300K inserts of 350bp
80kb reads (> 100X)
Only a few remaining: ~ 10 to ~50 sequences
100% should carry a methylation pattern
PacBio sequencing
Wild type:
Silencing of methylase candidates:
Also: AggSN of PGM32
Deduced origin
MIC DNA
Alignment of consensus
IES retained 1% of times in the MAC
=
2x chances of comming from MAC than MIC when fished
(Reminder: MAC/MIC = 200:1)
To avoid this
I call the sequences that passed this filter "TRUE_MAC_IES"
Total "genome"
Total Paramecium
Total sequencing
Inputs:
Outputs:
Output example
+ Applying the retention filter divides the number of "MAC_IES" approximately by 50%
% of Nuclear DNA without rDNA
% of total Pramecium DNA
For n6mA, PacBio produces:
The scores are PHRED-transformed p-values
Typical covscore plot
Modification score / coverage
Using flat threshold on modification score = Hudge lack of power
We implemented two thresholds that are function of the coverage:
We use our own pipeline for SMsn (in-sillico control):
Paramecium is fed with E. coli
GATC
CTAG
experiment HT2 1 HTVEG 926 MAB 487 MT1A1B 1131 MT1A1B2 222 MT2 166 NM4 468 NM4910 138 NM910 357
HT2/6: Starved !
1.702% of A are in a GATC site
Nb GATC sites: 4672
Number of consensus which mapped with >99% accuracy on a common strain reference genome
inside GATC
outside GATC
Only adenines with coverage>25X are considered
Coverage
Score
Identification assessment
Methylation outside GATC sites
Modification assessment: Score20
Methylation outside GATC sites
Modification assessment: Linear equation
FDR Lessons learned for later in Paramecium
We measure methylation in E. coli
We have a no-so-bad test:
99% Sensitivity = P(D | M)
99% Specificity = P(ND | NM)
Our global fraction of n6mA among all adenines is around 1%
1.98%
~50%
Half of our detections will be false positive
P(A|B) != P(B|A) unless P(A) = P(B)
Let's push it a little forward...
Bayes theorem
PyAgrum
Well-informed way
Blind way
?
Within a GATC site, how many modified ?
Ignoring the a priori leads to bad conclusions
Preliminary data showed that:
Which means:
Integrating this a priori is very important for us
Reminder:
In an ideal world:
p-value
SMSN n6mA testing
Prior knowledge of FDR
Assessed with Se, Sp and A priori on litterature
FDR control based on A priori knowledge
+
Likelihood ratio between different hypothesis
Enfin !
~95% of the methylation locates in AT dinucleotides in the MAC(*)
slightly lower in the MIC (5 to 5 points less)
True in any condition
75% of the methylation in an AT dinucleotide is actually symetrically modified, independantly from being in the MAC or the MIC(*)
Kept in MIC and MAC (All conditions)
(*) Linear equation / idQv20
Kept in the MAC for all experimental conditions
Impossible to tell in the MIC (not enough sequences)
In the MAC
42% GC
38% GC
Logo from HTVEG MAC
Identical everywhere
Short-term
Mean-term
Later
Sorry for the headaches !! Thanks for your time :)
Thanks !
Output example
IESs (1 & 2)
MAC (1 & 2)
Logo from HTVEG MAC
Identical everywhere
Laura landwebehr 2020
Oxytrichia trifallax
Modest variations in the silencing
Logo from HTVEG MAC
Identical everywhere
What interests us most
We didn't find no difference between experiments
The old MAC (maternal) drives everything during the formation of the new one:
= Non-mendelian / Cytoplasmic heridity
A outAT score 20 isQv20 (812 seq)
A outAT score20 idQv20 + Strong BH correction (176 seq)
ipdRatio out GATC before filtering BH vs after (qv20/idqv20)
After rigorous checking, I stated that the capping as implemented was not a problem for our SMSN approach
modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences
Conclusion on the capping
Recently, I have coded ipdtools. It predicts:
Main problems encountered:
Problems that we solved:
"smrtrue"
Private Github repository
Within the total DNA:
SMSN implemented on Github by J. Beaulaurier
But:
Using SMSN in our case is mandatory but also new, and requires to develop our own tools
--> "true_smrt" python package