Eric Meyer - Mathieu BAHIN
Bahin Mathieu
Bioinformatics
Eric Meyer
Biology - Parameciology
Suzanne Marques (PhD student), Veronique Tanty (IE - PhD student),
Sophie Malinski (MCU)
Eric's Team:
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
+ DIfficult to purify the MIC DNA
Up to 0.120 mm
IES
Internal Excised Sequences
10% of IES are scanRNA dependant.
30% of IES are small-RNA dependant (shown by DICER-like silencing)
One big question remains: How does the cell recognize the Scan-RNA independant IES ?
2 independant hypothesis:
n6mA is one the the 3 most frequent known methylations in DNA
6mA likely to be abundant in Paramecium:
Suspected 2.5% in P. aurelia by Cummings et Al (1975)
25x 250x 25x
RNA silencing
Candidates:
99% accuracy
Max
85% accuracy
Methylation analysis needs local coverage> 25X
Reads up to 80 kbp
Pooled analysis of different molecules (default - Aggregation)
Analyze every molecule independantly from the others
(Needs shorter inserts - multiple passes)
Beaulaurier et Al 2015
Higher resolution
SMsn
1:200 (MIC/MAC)
Random sampling
PacBio SMRT
150K to 300K inserts of 350bp
80kb reads (> 100X)
Only a few remaining: ~ 10 to ~50 sequences
100% should carry a methylation pattern
PacBio sequencing
Wild type:
Silencing of methylase candidates:
Also: AggSN of PGM32
Deduced origin
MIC DNA
Alignment of consensus
"genome"
Total Paramecium
Total sequencing
Output example
+ Applying the retention filter divides the number of "MAC_IES" approximately by 50%
% of total Pramecium DNA
(MIC, MAC, RDNA, mito)
Can we trust our technique ?
We measure methylation in E. coli
We have a no-so-bad test:
99% Sensitivity = P(D | M)
99% Specificity = P(ND | NM)
Our global fraction of n6mA among all adenines is around 1%
1.98%
~50%
Half of our detections will be false positive
P(A|B) != P(B|A) unless P(A) = P(B)
To develop a medical test (ex: cancer screening), we must already know who has cancer or not !
This is a prior
e.g: From clinical knowledge, biological knowledge, biochemical knowledge, etc
About Paramecium, preliminary data tend to show that:
About PacBio SMRT, best case scenario.
(Rough estimations,
Order of magnitude)
For n6mA, PacBio produces:
The scores are PHRED-transformed p-values
Typical covscore plot
Modification score / coverage
Using flat threshold on modification score = Hudge lack of power
We implemented two thresholds that are function of the coverage:
Paramecium is fed with E. coli
GATC
CTAG
experiment HT2 1 HTVEG 926 MAB 487 MT1A1B 1131 MT1A1B2 222 MT2 166 NM4 468 NM4910 138 NM910 357
HT2/6: Starved !
1.702% of A are in a GATC site
Nb GATC sites: 4672
Number of consensus which mapped with >99% accuracy on a common strain reference genome
inside GATC
outside GATC
Only adenines with coverage>25X are considered
Coverage
Score
Identification assessment
Methylation outside GATC sites
Modification assessment: Score20
Methylation outside GATC sites
Modification assessment: Linear equation
Se, Sp remain +/- unknown
In an ideal world:
p-value
SMSN n6mA testing
Prior knowledge of FDR
Assessed with Se, Sp and A priori on litterature
FDR control based on A priori knowledge
+
Likelihood ratio between different hypothesis
Enfin !
~95% of the methylation locates in AT dinucleotides in the MAC(*)
slightly lower in the MIC (5 to 5 points less)
True in any condition
75% of the methylation in an AT dinucleotide is actually symetrically modified, independantly from being in the MAC or the MIC(*)
Kept in MIC and MAC (All conditions)
(*) Linear equation / idQv20
Kept in the MAC for all experimental conditions
Impossible to tell in the MIC (not enough sequences)
In the MAC
42% GC
38% GC
Logo from HTVEG MAC
Identical everywhere
Many methodological advances:
On Paramecium:
Short-term
Mean-term
Later: Quit PacBio SMRT
Sorry for the headaches !! Thanks for your time :)
Thanks !
Output example
IESs (1 & 2)
MAC (1 & 2)
Logo from HTVEG MAC
Identical everywhere
Laura landwebehr 2020
Oxytrichia trifallax
Modest variations in the silencing
Logo from HTVEG MAC
Identical everywhere
What interests us most
We didn't find no difference between experiments
The old MAC (maternal) drives everything during the formation of the new one:
= Non-mendelian / Cytoplasmic heridity
A outAT score 20 isQv20 (812 seq)
A outAT score20 idQv20 + Strong BH correction (176 seq)
ipdRatio out GATC before filtering BH vs after (qv20/idqv20)
After rigorous checking, I stated that the capping as implemented was not a problem for our SMSN approach
modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences
Conclusion on the capping
Recently, I have coded ipdtools. It predicts:
Main problems encountered:
Problems that we solved:
"smrtrue"
Private Github repository
Within the total DNA:
SMSN implemented on Github by J. Beaulaurier
But:
Using SMSN in our case is mandatory but also new, and requires to develop our own tools
--> "true_smrt" python package