PhD Defense
DELEVOYE Guillaume
09/06/2022
Supervisor : Dr MEYER Eric
Jury members : Dr DUHARCOURT Sandra, Dr CHEN Chunlong, Dr DURET Laurent
Unicellular eucaryote (ciliate) with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
After sexual processes, a new MAC is formed, with important genome re-arrangements
Results in a MAC DNA almost purely made of coding sequences
Coyne et al. 2012
49.260 unique sequences
O. Arnaiz et al 2012
"Invade Bloom Abdicate Fade" model
Adapted from Glen Arthur Herrick 1997
Excision = 100% PiggyMac (Pgm)
O. Arnaiz et al. 2012
Not sufficient to distinguish IESs from the rest of the genome
S. E. Allen and M. Nowacki - 2017
If not in the maternal MAC : Recognized and excised
Inactivation of scnRNA and iesRNA pathways:
All IESs may be recognized through the small RNAs
... but is there a redundant system for the oldest/shortest ones ?
Problematic ~ Self VS non-self recognition
Hypothesis : Role of DNA modifications
DNA modifications could play a role :
2.1-2.5% in the MAC and MIC of P. aurelia (Cummings et al. 1975)
... Could be N6-methyladenine (6mA)
?
Other:
If the pattern allows the recognition of IESs, it must allow :
A typical example : DNMT1-Like system
And many other possibilities...
Maintenance
through replication ?
Single-nucleotide precision OK
Maintenance through replication : OK if DNMT1-Like
Single-nucleotide precision ?
Long before I arrived in the lab !
Expr. Sexual events (Collaborators)
RNA methyltransferases ?
Our 6 proteins
WT Veg
Control
silencing
T=2h
T=6h
RNA interference
Candidate methylases
Reduction 6mA
Southwesternblot
up to 90%
Total DNA
1:200 MIC !!!
Protocol :
Autogamy
~ 85% accuracy
~ 100% accuracy
if many passes
10 – 15 kb
IPD
Nucleotide context (-3/+8nt)
Kinetic signatures
Depends on
DNA modifications
$$ipdRatio= \frac{MeanIPD_{experience}}{unmethylated\ control}$$
(Relevant if >25 measures)
IPD
Two types of control :
e.g
~O(20) to ~O(100) molecules
[...]
Problem = Purification of MIC DNA
(min. 25X)
Transcient ?
each molecule
~O(20) to ~O(100) measures
(MIC)
Workaround : work on total DNA instead
But purifying the MIC is problematic
(min. 25X)
Look at methylation (only) around (some) IESs
in-silico control required
NM4.bam
NM9_10.bam
MT1A_1B.bam
MT1A_1B_2.bam
MT2.bam
HTVEG.bam
MAB.bam
HT2.bam
HT6.bam
NM4_9_10.bam
Challenge = Sorting + No pipeline
~ 85% accuracy
~ 100% accuracy
10 – 15 kb
~350bp
Alignment
MAC
MAC+IES
MIC
IES
Other MIC
Other MIC
IES
Mac Destinated Sequences (MDS)
MAC
TA Junction
That is,
Orders of magnitude :
This is not much, but if we are right 100% of the scnRNA independent IESs could be methylated
P(R)
1 - P(R)
5 reads IES -
1 read IES +
2 reads IES +
$$IRS_L = \frac{2}{2+5} \approx 27\%$$
$$IRS_R = \frac{1}{1+5} \approx 16\%$$
The higher the IRS, the higher the retention.
Danger !
e.g
MIC = 4n, MAC = 800n, R = 1/200 = 0.005 , N = 100 NGS reads
$$\mathbb{E}(IRS)= 0$$
$$P(MIC|IES+) = 50\%$$
No !
Even a low IRS can be problematic for us !
When the N is small (~100), it's just impossible to see small retention levels
Due to the MAC ploidy, even the slightest retention leads to $$P(MIC|IES^+) < P(MAC|IES^+)$$
Let's just keep all IESs with an IRS = 0 ?
??
$$P(MIC|IES+) \in [9.5\% - 93.7\%], \alpha = 5\%$$
Problem : The size of confidence intervals is very big
For most IESs, we will simply not be able to tell wether it comes from the MAC or the MIC
Workaround : Pooling samples
Rare picture of Eric, doing some archeology to find more samples to pool and gain coverage (circa 2022, colorized)
If MAC ploidy = 800n than without retention :
$$E(IRS) = \frac{4}{800+4} \approx 0.005$$
If retention :
$$E(IRS) >> \frac{4}{800+4}$$
0.002-0.003
For our calculus, we used K = 1600n
When using :
On average, we will still have only very vague estimates of P(MIC|IES+) !
On average on all the IESs without retention :
$$R = 0 \implies P(MIC|IES^+) \in [33\% - 100\%]$$
CONCLUSION :
On a fake dataset (simulated retention)
Real retention R (simulated)
P(MIC|IES+)
Can we identify the MIC sequences ?
> We should expect lots of our MIC data to be impossible to use
$$R > 0 \implies P(MIC|IES^+) \approx 0$$
(+ Found a new way to quantify IES retention)
(+ re-estimated the MAC ploidy)
Outputs scores :
"ipdRatio": 2, 1, 4...
"ModificationQv" : 0, 30, 25...
"identificationQv": 1, 0, 50...
When do we call a nucleotide methylated ?
E.coli is used to feed paramecium (contaminants)
> Can be used to benchmark our pipeline's ability to detect 6mA
Motif effect
Coverage
L. methylated
L. unmethylated
> 50
35-50
25-35
15-25
0-15
Coverage effect
Either a nucleotide is methylated, or it is not :
Our pragmatical solution : An arbitrary linear threshold
If we make the simplification that all GATC/EcoK sites are methylated and that 6mA is only present there :
$$Sensitivity = P(D|M)$$
$$Se = 92\%$$
But :
$$Specificity = P(\overline{D}|\overline{M})$$
$$Sp = 99.8\%$$
We can easily have more than 50% of so-called hemi-methylated sites that are actually not hemi-methylated
Quantifying hemi-methylation is tricky if $$Se < 100\%\ and\ Sp\ < 100\%$$
PacBio sequencing was already known for its propensity to generate false positives for 4mC (K. O’Brown et al. 2014)
Qv30
Great to detect 6mA
Other ??
Potential problems when quantifying hemi-methylation
(+ Found a new way to quantify IES retention)
(+ re-estimated the MAC ploidy)
• Between 1.25% and 1.45% of 6mA in the MAC
• Between 97.39 and 100% of them are located in
AT sites
Taking account of the uncertainty of Se and Sp :
Problem : Some results will vary greatly depending on Se and Sp !
| Se | Sp | Interpretation | Scenario Number |
|---|---|---|---|
| 100% | 100% | Perfect | 1 |
| 100% | 99.8% | Sometimes it misses | 2 |
| 92% | 100% | Sometimes it invents | 3 |
| 92% | 99.8% | A few confusions here and there | 4 |
> Implied in the bulk of 6mA in the MAC
P. tetraurelia
Mitochondrial
Total BET
-80 to -90% NM proteins
-60 to 70% MT proteins
We expected -90%
Get only -50% ??
Raise of hemi-methylation, whose intensity depends importantly on how well Se and Sp are well estimated or not
Fraction of AT sites that are hemi-methylated
The capacity to make symmetrical methylation is never abolished completely
De novo methylation of unmethylated AT sites :
Unchanged NM4.
Drops everywhere else
All weakly implied except NM4 (not implied)
All strongly implied
Never abolished
Function = Symmetry +++
Predicted FDR : 100%
But likely detection outside of AT sites too
never erased
AGAA and GAGG motif
are documented as methylated sites (6mA) in C. el-
egans too (Greer et al. 2015)
(+ Found a new way to quantify IES retention)
(+ re-estimated the MAC ploidy)
Number of molecules with at least one exploitable adenine
- several IESs
- variable MAC regions
- extremity outliers
Is our calculus valid for several IESs ?
Remaining with computable P(MIC|IES+)
The vast majority of IES+ molecules come actually... From the MAC !!!
Methodological developments :
On our MAC data :
On our MIC data:
We cannot exclude the hypothesis that all 6mA comes from the MAC
On these data :
For the future :
Team Meyer
Team Genovesio
SPIBENS
Les infos et admin.
Jury: Chunlong, Sandra, Eric, Laurent
Comité de thèse: Linda, Mireille, Chunlong
La communauté paramécie
Les petites mains derrière les données
Mes collègues de pause
Toi public
Mentions spéciales contributions directes :
Eric Meyer
France Rose
Mathieu Bahin
O. Arnaiz
Leandro
MTA1 -- orthologue 4-9-10
MTA9 -- Pas catalytique chez Tetrahymena
[..] --> MT1A1B2
Pas 6mA tetrahymena MIC
MTA1 -- orthologue 4-9-10
MTA9 -- Pas catalytique chez Tetrahymena
[..] --> MT1A1B2
Pas 6mA tetrahymena MIC
If p number of positive detections among N tests:
p = FP + TP
So,
Which means
And:
Let FD1 and FD2 be resp:
Then:
With
We can also find the number of hemi-methylated sites being detected as such, and the proportion of sites detected as hemi-methylated that are really hemi-methylated. This is possible because we now approximately know PZ0, PZ1 and PZ2, and P(D|Z) is easy to determine:
Then, P(Z|D) can be determined through Bayes theorem using P(D|Z), P(Z) and P(D) (which are all known)
P(Z=1|D=1)
is our case of interest
modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences
Conclusion on the capping
Laura landwebehr 2020
Oxytrichia trifallax
A outAT score 20 isQv20 (812 seq)
A outAT score20 idQv20 + Strong BH correction (176 seq)