Lab Meeting 11/05/2023
Philosophical Transactions, vol. 23, issue 288. 1703.
CNRS, Leica Microsystems, 2022
2022: PhD degree at ENS Paris (bioinformatics)
2018: MSc Bioinformatics - Paris Diderot
Code
Stats
Hematology (HSCT)
ML
Ciliates
NGS data analysis
Structural biology
MCMC
DNA methylation
"The white rats of ciliophora"
A. van Leeuwonhoek (1668)
"Animalcules"
Pasteur (1862)
Spontaneous generation
HS Jennings (~ 1900)
Paramecium as a model
T. Sonneborn (1937)
Non-mendelian inheritance
of sexual type
in Paramecium
Carol Greider &
Elizabeth Blackburn
Telomeres (1985) - Nobel prize
Meyer and Duharcourt (2014)
Sexual type is inherited via maternal RNAs, in Paramecium
> The genome-wide programmed rearrangements <
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
The difference ?
MAC = Somatic genome
MIC = Full germline genome
Very rare picture of me, explaining my PhD during the lab meeting Pere Castor et Al. (Flammarion FR - Circa 2023 - Colourized)
Lessons learned from 3.5 Bn years of evolution
Barbara McClintock : The jumping genes
(Finnegan)
Copy/Paste
Cut/Paste
Both can be autonomous or dependant
... And many others
Up to 86%
He dead ?
2 hypothesis (none proved):
1 - Very high multiplication rate + Very high vulerability to TE
2 - Defense mechanism so great it's not affected
TEs were first considered as :
TEs are conserved not because they provide an additional fitness to their host, but despite the fact that they don't. This is the non-phenotypic selection.
Doolittle, Orgel, Crick and Sapienza (1980)
The same day in Nature journal
Wicker et al. 2007
Based on:
The cut-paste versus copy-paste comparison turned out to less relevant over time
The paradigm is currently shifting from "Junk DNA" to "Major actors of evolution".
$$\sim2^N$$
> Any particulary vulnerable organisms has been wiped out in the past
> Logical conclusion : All remaining life forms have some kind of resilience towards TEs.
> All virulent TEs have wiped out their host (and disappeared with them)
Many epigenetic regulations exist :
P. tetraurelia has an original way of dealing with TEs
Hints of a hidden past
The case of the "Sleeping beauty" transposon
After sexual processes, a new MAC is formed, with important genome re-arrangements
Results in a MAC DNA almost purely made of coding sequences
Coyne et al. 2012
49.260 unique sequences
O. Arnaiz et al 2012
"Invade Bloom Abdicate Fade" model
Adapted from Glen Arthur Herrick 1997
Excision = 100% PiggyMac (Pgm)
O. Arnaiz et al. 2012
Not sufficient to distinguish IESs from the rest of the genome
S. E. Allen and M. Nowacki - 2017
If not in the maternal MAC : Recognized and excised
We understand the recognition of ~30% of IESs (small ncRNA)
...How does the cell recognizes the other 70% ?
But, almost nothing was known about DNA Methylation in both the MIC, and the MAC
O. Arnaiz et al 2012
O. Arnaiz et al 2012
"Invade Bloom Abdicate Fade" model
Adapted from Glen Arthur Herrick 1997
Excision = 100% PiggyMac (Pgm)
Invade, Bloom, Abdicate, Fade
S. E. Allen and M. Nowacki - 2017
If not in the maternal MAC : Recognized and excised
Inactivation of scnRNA and iesRNA pathways:
All IESs may be recognized through the small RNAs
... but is there a redundant system for the oldest/shortest ones ?
Problematic ~ Self VS non-self recognition
Hypothesis : Role of DNA modifications
DNA modifications could play a role :
2.1-2.5% in the MAC and MIC of P. aurelia (Cummings et al. 1975)
... Could be N6-methyladenine (6mA)
?
Other:
If the pattern allows the recognition of IESs, it must allow :
A typical example : DNMT1-Like system
And many other possibilities...
Maintenance
through replication ?
Single-nucleotide precision OK
Maintenance through replication : OK if DNMT1-Like
Single-nucleotide precision ?
Constant 6mA pattern in the MIC (only) ?
Other:
2.1-2.5% in the MAC and MIC of P. aurelia (Cummings et al. 1975)
Long before I arrived in the lab !
Expr. Sexual events (collaborators)
RNA methyltransferases ?
Our 6 proteins
Protocol :
Vegetative
cells
Control
silencing
RNA interference
Candidate methylases
Preliminary results : Reduction 6mA
Southwesternblot
up to 90%
Total DNA
1:200 MIC !!!
P. tetraurelia
Mitochondrial
Total BET
IPD
Nucleotide context (-3/+8nt)
Kinetic signatures
Depends on
DNA modifications
~ 85% accuracy
~ 100% accuracy
if many passes
10 – 15 kb
IPD
Nucleotide context (-3/+8nt)
Kinetic signatures
Depends on
DNA modifications
e.g
~O(20) to ~O(100) molecules
[...]
Problem = Purification of MIC DNA
(min. 25X)
each molecule
~O(20) to ~O(100) measures
(MIC)
Workaround : work on total DNA instead
But purifying the MIC is problematic
(min. 25X)
Look at methylation (only) around (some) IESs
in-silico control required
~ 85% accuracy
~ 100% accuracy
10 – 15 kb
~350bp
Alignment
MAC
MAC+IES
MIC
IES
Other MIC
Other MIC
IES
Mac Destinated Sequences (MDS)
MAC
TA Junction
Total DNA was sequenced instead
IES
Other MIC
Other MIC
IES+
Mac Destinated Sequences (MDS)
IES-
Total DNA -> 99.5% of sequencing data = waste
Reminder : 1:200 comes from the MIC
That is,
This is not much, but if had been right, 100% of them had to be methylated
P(R)
1 - P(R)
> Not all IES+ molecules come from the MIC
Danger ! -> We are interested only in the *MIC*
e.g : If retention = 1/200,
~4 IES+ reads come from the MAC
4 IES+ reads come from the MIC
P( MIC|IES+ ) = 50%
MIC IES+ reads
MAC IES+ reads
MAC IES- reads
MIC : 4n
MAC: 800n
5 reads IES -
1 read IES +
2 reads IES +
$$IRS_L = \frac{2}{2+5} \approx 27\%$$
$$IRS_R = \frac{1}{1+5} \approx 16\%$$
The higher the IRS, the higher the retention.
If MAC ploidy = 800n than without retention :
$$E(IRS) = \frac{4}{800+4} \approx 0.005$$
If retention :
$$E(IRS) >> \frac{4}{800+4}$$
0.002-0.003
??
P(R) is the only
unknown variable
On a fake dataset (simulated retention)
Real retention R (simulated)
P(MIC|IES+)
Can we identify the MIC sequences ?
> We should expect lots of our MIC data to be impossible to use
$$R > 0 \implies P(MIC|IES^+) \approx 0$$
> With an updated MAC ploidy of 1600N...Most IES+ reads come from the MAC !
> Extremely problematic
for us
PB : When do we call a nucleotide methylated ?
E.coli is used to feed paramecium (contaminants)
~100% of GATC sites = 6mA
Same for EcoK sites
>> Benchmark
Motif effect
Coverage
L. methylated
L. unmethylated
> 50
35-50
25-35
15-25
0-15
Coverage effect
Either a nucleotide is methylated, or it is not :
Our pragmatical solution : An arbitrary linear threshold
If we make the simplification that all GATC/EcoK sites are methylated and that 6mA is only present there :
$$Sensitivity = P(D|M)$$
$$Se = 92\%$$
But :
$$Specificity = P(\overline{D}|\overline{M})$$
$$Sp = 99.8\%$$
We can easily have more than 50% of so-called hemi-methylated sites that are actually not hemi-methylated
Quantifying hemi-methylation is tricky if $$Se < 100\%\ and\ Sp\ < 100\%$$
| Se | Sp | Interpretation | Scenario Number |
|---|---|---|---|
| 100% | 100% | Perfect | 1 |
| 100% | 99.8% | Sometimes it misses | 2 |
| 92% | 100% | Sometimes it invents | 3 |
| 92% | 99.8% | A few confusions here and there | 4 |
PacBio sequencing was already known for its propensity to generate false positives for 4mC (K. O’Brown et al. 2014)
Qv30
Depending on Se&Sp :
Raise of hemi-methylation, whose intensity depends importantly on how well Se and Sp are well estimated or not
Fraction of AT sites that are hemi-methylated
The capacity to make symmetrical methylation is never abolished completely
De novo methylation of unmethylated AT sites :
Unchanged NM4.
Drops everywhere else
Predicted FDR : 100%
But likely detection outside of AT sites too
never erased
AGAA and GAGG motif
are documented as methylated sites (6mA) in C. el-
egans too (Greer et al. 2015)
Beh et al. 2019
The methylation pattern in the MAC is close to the one already observed in Oxytricha Triffalax
All weakly implied
All strongly implied
Function = Symmetry +++
Details on demand...
Still:
But: This is actually a pure coincidence
P(hemi-methylated | called hemi-methylated) ~ 20% to 50% !
> Very hard to make qualitative analysis
Reminder : Se ~ 92% and Sp ~ 99.9%
There is not a single methylated molecule for which we have a reasonable certitude that it comes from the MIC...
Number of molecules with at least one exploitable adenine
- several IESs
- variable MAC regions
- extremity outliers
Is our calculus valid for several IESs ?
Remaining with computable P(MIC|IES+)
Methodological developments :
On our MAC data :
On our MIC data:
We cannot exclude the hypothesis that all 6mA comes from the MAC
On these data :
For the future :
Interesting discussion :
Hematopoietic Stem Cell transplants (HSCT)
Preliminary results +/- encouraging
Still ongoing
Dire Straits <3
4X - Strategy games
Chess
Memes
Climbing
lolcats
who doesn't ?
Our supreme leader
Guido Van Rossum
What is he ? A man ? A God ?
We'll never know.
(when I win)
Playing music
Internet
Hiking
Thanks :)
Candidates identified:
2) PacBio sequencing with short inserts | 6mA ++
1) Grouped silencings by sequence homology
Inserts up to 350bp
Sequencing : unique molecule & both strands
Polymerase slowing ~ methylated adenine
IESs are sometimes retained in the MAC
What is P(MIC|IES+) ?
(skipped)
Details that I spare you (but don't hesitate to ask) :
No scientific method is perfectly reliable
In our case:
> Expect False Positives and False Negatives
We can use 4 extreme scenarii :
... And see what is consistent no matter the scenario we place ourselves into
If p number of positive detections among N tests:
p = FP + TP
So,
Which means
with fraction of 6mA
$$\pi =$$
> 95% n6mA are located in AT sites
Let FD1 and FD2 be resp:
Then:
With
2) Transcient in the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Our hypothesis:
Analysis showed that:
~2.5% of n6mA in Paramecium (MIC) : Cummings 1974
> Actually way lower (if any)
> Unexpectedly, NM and MT proteins play a whole another role in the MAC,
not the MIC
IGEM2014
MT and NM families :
In the MAC
Convertase activity DNMT1-like
But: Hemi-methylated AT sites kept after many mitosis
New hypothesis:
Symmetrical methylation of AT sites = Mitotic clock or maturity indicator for the cell to be allowed to go enter meiosis
Red socks
Fun
Windows 10 update of May 2021
1) Paramecium tetraurelia
3) DNA methylation
4) PacBio sequencing
2) Transposable elements / IESs
5) hemi-methylation of
palindromic motifs
(AT sites)
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
The difference ?
MAC genome compared to MIC =
In the MIC:
--> How ?
Present in the MIC
Absent in the MAC
= Excised generation after generation
2) Transcient in the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Our hypothesis:
~2.5% of n6mA in Paramecium (MIC&MAC) : Cummings 1974
Candidates:
2) PacBio sequencing with short inserts | 6mA ++
1) Grouped silencings by sequence homology
Reads up to 80 kbp
1) Sequencing (unique molecule)
2) High fidelity consensus
3) DNA methylation analysis
2) Crosstalks in the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Our hypothesis:
Prelimary analysis shows that:
~2.5% of n6mA in Paramecium (MIC&MAC) : Cummings 1974
~95% of n6mA locates in AT dinucleotides in the MAC
75% of the methylation in an AT dinucleotide is actually symetrically modified
Bulk of methylation:
Total = 1.2 to 1.5% of adenines
0.6% of the adenines outside AT sites:
No scientific method is perfectly reliable
In our case:
If p number of positive detections among N tests:
p = FP + TP
So,
Which means
with fraction of 6mA
$$\pi =$$
Let FD1 and FD2 be resp:
Then:
With
Thanks :)
Guillaume DELEVOYE
3rd Year PhD student
Bioinformatics
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
+ DIfficult to purify the MIC DNA
45.000 Unique sequences
30% of IESs only are small-ncRNA dependant (shown by DICER-like2-3 silencing)
6mA likely to be abundant in Paramecium:
Suspected 2.5% in the MAC of P. aurelia by Cummings et Al (1975)
Also documented 6mA in the MIC
2) Crosstalks in the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Candidates:
2) PacBio sequencing with short inserts | 6mA ++
1) Grouped silencings by sequence homology
Reads up to 80 kbp
1) Sequencing (unique molecule)
2) High fidelity consensus
3) DNA methylation analysis
99% accuracy
Max
75% accuracy
Because our inserts are circular and shorts, we can make CCS of high accuracy despite a 15% error-rate
Deduced origin
MIC DNA
Alignment of consensus
Only a few remaining: ~ 10 to ~200 sequences
100% should carry a methylation pattern
"genome"
Total Paramecium
Total sequencing
Little bug to be corrected
30%
+ Applying the retention filter divides the number of "MAC_IES" approximately by 50%
H0: ipdRatio of umethylated-Adenines
H1: ipdRatio of 6mA
S: Threshold on the pvalue
--> Specificity
--> Sensitivity
$$\alpha$$
$$1 - \beta $$
Adenine
6mA
ln(ipdRatio) ~ N(0,1)
log(ipdRatio)
A pvalue is just the probability that log(ipdRatio) in the tail of H0
E.coli:
I investigated the PacBio's output on it's GATC & EcoK VS other sites
Separability raises with coverage, which is expected
...but are not ideally distributed
Ideal pvalues
--> Allows magic !
PacBio's
$$\pi_{0}, \pi_{1}$$
All Adenines' pvalues [E.coli] coverage > 40X
PacBio's pvalues:
For n6mA, PacBio produces:
They are PHRED-transformed p-values of two different statistical tests, that rely on the mean of the IPDs
The scores (Qv) are PHRED-transformed p-values
Typical covscore plot
Modification score / coverage
Using flat threshold on modification score = Hudge lack of power
From now on
"positive detection"
=
score > linear thershold
(only >25X considered)
How good (or bad) is our method ?
$$Se = P(D^+|M) ~ 92\%$$
$$Sp = P(D^-|NM) ~ 99.8\%$$
Starting from sufficient coverages (~20X to ~30X), Se and Sp don't depend on the coverage anymore
If p number of positive detections among N tests:
p = FP + TP
So,
Which means
And:
~95% of the methylation locates in AT dinucleotides in the MAC
True in any condition
75% of the methylation in an AT dinucleotide is actually symetrically modified, independantly from being in the MAC or the MIC
Kept in MIC and MAC
All conditions
Present in all samples
Never erased
Conclusion: We are either in scenario 1 or 3
(Sp largely underestimated)
Was expected but confirmed
Let FD1 and FD2 be resp:
Then:
With
Pure MIC sequences: Cannot yet be trusted (error in the pipeline)
Coverage >= 40X made us lose too many materials, should restart with >=20X
HTVEG
MT2
--> Some molecules carry all the detections, in sym-A*T
Very likely to be sequences comming from the MAC to my opinion
At first look, seems like the same in all samples
Sorry for the headaches !! Thanks for your time :)
Ciliates are great model organisms
1862 - Pasteur: Refutation of the spontaneous generation theory with infusoria
1937 - Sonneborn: Non-mendelian inheritance of sexual type in paramecium
Elizabeth blackburn & Carol Greider: 1985 - Telomeres and telomerases in Tetrahymena (Nobel Prize 2009)
Eric Meyer, Sandra Duharcourt
(IBENS, I. Jacques Monod 2014)
Sexual type in paramecium are transmitted by maternal RNAs, not by DNA
modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences
Conclusion on the capping
Laura landwebehr 2020
Oxytrichia trifallax
A outAT score 20 isQv20 (812 seq)
A outAT score20 idQv20 + Strong BH correction (176 seq)
ipdRatio out GATC before filtering BH vs after (qv20/idqv20)