Dec. 15th 2025
Guillaume DELEVOYE
Postdoc researcher, Karlstad University, Public health department
guillaume.delevoye@kau.se
Not a course
No grade, no evaluation
Objectives :
Postdoc researcher, bioinformatician
Public health department, KAU
2.000 pairs : pregnant woman + child
Study the impact of endocrine disruptors on children's health
My focus/background/expertise :
A bioinformatician is a Jack-of-all-trades
... But must have some knowledge, in all of the above
Being curious and self-taught is key
Like public health
... Yet here I am.
Because, actually, it is fun !
John Tukey (1915-2000)
Other contributions :
Defense industry, telecommunications, sexual orientation research, Ozone layer damages, TV polls, analysis of elections, Education, Printer market, Pharmacy, ...
The better you are at statistics, the more fun you'll have in inter-disciplinary research
Every bioinformatician is vastly self-taught :
But,
... And that's kind of OK actually
Some of my past projects
Developping a machine-learning web app to estimate the donor's compatibility in a blood marrow transplant
There is so much to learn everyday !
Ex : Negative ages ?
Here : Random forest
Example : doi: 10.1038/bmt.2016.162.
But ! It is impossible to start working on bone-marrow transplant without basic knowledge in :
A priori it's just statistics like any other
Looking for DNA-methylation markers of Endocrine Disruptors in children's blood
Outlier detection
Ex : Identifying intersex participant in SELMA data
Domain-knowledge is key
DBScan clustering, after projection in 2D of all the chemical exposures (SELMA data)
Positions in the human genome where DNA methylation of blood cells is affected by phtalates
False Discovery Rate control ...
Deconvolution of gaussian data
After deconvolution
Discordant diagnosis test among couples
... But the test is not perfect !
Se = Sp = 99%
Seamingly logical conclusion :
COVID-19 is not very transmissible ?!
Update quarantine guidelines ?!
$$P(+/+)$$
Let
$$FD_2$$
is the observed proportion of +/+
Where
$$FD_1$$
is the observed proportion of -/+
The unbiased fraction of truly +/+ couples is then :
But : Almost 100% of the couples detected D+/D- are actually COVID-/COVID- !
979.000 couples are D-/D-
1.000 couples are D+/D+
20.000 couples are D+/D-
Reason = Imperfect test (Se = 99%, Sp = 99%)
Ex:
Detecting DNA methylation in palindromic motifs
Methylated adenines can also be called with Se = 99% and Sp = 99%, like our patients before
Since there are two adenines in the couple, same logic applies
A web app to study DNA methylation in ciliates
As part of my PhD
Each individual DNA molecule is immobilized and replicated by a DNA polymerase
Nucleotide context (-3/+8nt)
Kinetic signatures
Depends on
DNA modifications
Adenine
6mA
ln(ipdRatio) ~ N(0,1)
log(ipdRatio)
1. Normalized by speed in unmethylated DNA :
2. Then : Usual parametric test
$$ipdRatio= \frac{MeanIPD_{experience}}{unmethylated\ control}$$
Example of deliverable : Web app
Exemple of deliverable :
An analysis package coded in python (several thousands of liness)
Exemple of deliverable :
4 years of extremely hard work for a harshly non-significant p-value
p ~ 0.99
Learn
Teach
Understand
collaborators
Communicate
Animate meeting,
work groups,
workshops
Be curious
Sometimes be diplomat
Understand training & cultural differences
Share data, methods, results
Sometimes stand your ground
You need to
Almost 100% of bioinformaticians have a previous training in a different field :
Biology, medicine, chemistry, physics, public health, statistics, programmer, data scientist...
>You< could become a bioinformatician !
2022: PhD degree at ENS Paris (bioinformatics)
2018: MSc Bioinformatics - Paris Diderot
University will teach you only what is extremely basic.
Don't forget to learn also by :
One last word
Thanks :)
I never passed my statistics exams
Not even
Text
Code
Stats
Hematology (HSCT)
ML
Ciliates
NGS
Structural biology
Python C++ Bash R
1) Paramecium tetraurelia
3) DNA methylation
4) PacBio sequencing
2) Transposable elements / IESs
5) hemi-methylation of
palindromic motifs
(AT sites)
Candidates identified:
2) PacBio sequencing with short inserts | 6mA ++
1) Grouped silencings by sequence homology
Inserts up to 350bp
Sequencing : unique molecule & both strands
Polymerase slowing ~ methylated adenine
IESs are sometimes retained in the MAC
What is P(MIC|IES+) ?
(skipped)
Details that I spare you (but don't hesitate to ask) :
No scientific method is perfectly reliable
In our case:
> Expect False Positives and False Negatives
We can use 4 extreme scenarii :
... And see what is consistent no matter the scenario we place ourselves into
If p number of positive detections among N tests:
p = FP + TP
So,
Which means
with fraction of 6mA
$$\pi =$$
> 95% n6mA are located in AT sites
Let FD1 and FD2 be resp:
Then:
With
2) Transcient in the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Our hypothesis:
Analysis showed that:
~2.5% of n6mA in Paramecium (MIC) : Cummings 1974
> Actually way lower (if any)
> Unexpectedly, NM and MT proteins play a whole another role in the MAC,
not the MIC
IGEM2014
MT and NM families :
In the MAC
Convertase activity DNMT1-like
But: Hemi-methylated AT sites kept after many mitosis
New hypothesis:
Symmetrical methylation of AT sites = Mitotic clock or maturity indicator for the cell to be allowed to go enter meiosis
Red socks
Fun
Windows 10 update of May 2021
1) Paramecium tetraurelia
3) DNA methylation
4) PacBio sequencing
2) Transposable elements / IESs
5) hemi-methylation of
palindromic motifs
(AT sites)
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
The difference ?
MAC genome compared to MIC =
In the MIC:
--> How ?
Present in the MIC
Absent in the MAC
= Excised generation after generation
2) Transcient in the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Our hypothesis:
~2.5% of n6mA in Paramecium (MIC&MAC) : Cummings 1974
Candidates:
2) PacBio sequencing with short inserts | 6mA ++
1) Grouped silencings by sequence homology
Reads up to 80 kbp
1) Sequencing (unique molecule)
2) High fidelity consensus
3) DNA methylation analysis
2) Crosstalks in the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Our hypothesis:
Prelimary analysis shows that:
~2.5% of n6mA in Paramecium (MIC&MAC) : Cummings 1974
~95% of n6mA locates in AT dinucleotides in the MAC
75% of the methylation in an AT dinucleotide is actually symetrically modified
Bulk of methylation:
Total = 1.2 to 1.5% of adenines
0.6% of the adenines outside AT sites:
No scientific method is perfectly reliable
In our case:
If p number of positive detections among N tests:
p = FP + TP
So,
Which means
with fraction of 6mA
$$\pi =$$
Let FD1 and FD2 be resp:
Then:
With
Thanks :)
Guillaume DELEVOYE
3rd Year PhD student
Bioinformatics
Unicellular eucaryote with 3 nuclei:
DNA ratio: 1 MIC for 200 MAC
+ DIfficult to purify the MIC DNA
45.000 Unique sequences
30% of IESs only are small-ncRNA dependant (shown by DICER-like2-3 silencing)
6mA likely to be abundant in Paramecium:
Suspected 2.5% in the MAC of P. aurelia by Cummings et Al (1975)
Also documented 6mA in the MIC
2) Crosstalks in the new forming MAC
1) Constant pattern in the MIC
Transcient ?
Candidates:
2) PacBio sequencing with short inserts | 6mA ++
1) Grouped silencings by sequence homology
Reads up to 80 kbp
1) Sequencing (unique molecule)
2) High fidelity consensus
3) DNA methylation analysis
99% accuracy
Max
75% accuracy
Because our inserts are circular and shorts, we can make CCS of high accuracy despite a 15% error-rate
Deduced origin
MIC DNA
Alignment of consensus
TA
TA
TA
Deduced origin
MIC DNA
Alignment of consensus
Only a few remaining: ~ 10 to ~200 sequences
100% should carry a methylation pattern
"genome"
Total Paramecium
Total sequencing
Little bug to be corrected
30%
+ Applying the retention filter divides the number of "MAC_IES" approximately by 50%
H0: ipdRatio of umethylated-Adenines
H1: ipdRatio of 6mA
S: Threshold on the pvalue
--> Specificity
--> Sensitivity
$$\alpha$$
$$1 - \beta $$
Adenine
6mA
ln(ipdRatio) ~ N(0,1)
log(ipdRatio)
A pvalue is just the probability that log(ipdRatio) in the tail of H0
E.coli:
I investigated the PacBio's output on it's GATC & EcoK VS other sites
Separability raises with coverage, which is expected
...but are not ideally distributed
Ideal pvalues
--> Allows magic !
PacBio's
$$\pi_{0}, \pi_{1}$$
All Adenines' pvalues [E.coli] coverage > 40X
PacBio's pvalues:
For n6mA, PacBio produces:
They are PHRED-transformed p-values of two different statistical tests, that rely on the mean of the IPDs
The scores (Qv) are PHRED-transformed p-values
Typical covscore plot
Modification score / coverage
Using flat threshold on modification score = Hudge lack of power
From now on
"positive detection"
=
score > linear thershold
(only >25X considered)
How good (or bad) is our method ?
$$Se = P(D^+|M) ~ 92\%$$
$$Sp = P(D^-|NM) ~ 99.8\%$$
Starting from sufficient coverages (~20X to ~30X), Se and Sp don't depend on the coverage anymore
If p number of positive detections among N tests:
p = FP + TP
So,
Which means
And:
~95% of the methylation locates in AT dinucleotides in the MAC
True in any condition
75% of the methylation in an AT dinucleotide is actually symetrically modified, independantly from being in the MAC or the MIC
Kept in MIC and MAC
All conditions
Present in all samples
Never erased
Conclusion: We are either in scenario 1 or 3
(Sp largely underestimated)
Was expected but confirmed
Let FD1 and FD2 be resp:
Then:
With
Pure MIC sequences: Cannot yet be trusted (error in the pipeline)
Coverage >= 40X made us lose too many materials, should restart with >=20X
HTVEG
MT2
--> Some molecules carry all the detections, in sym-A*T
Very likely to be sequences comming from the MAC to my opinion
At first look, seems like the same in all samples
Sorry for the headaches !! Thanks for your time :)
Ciliates are great model organisms
1862 - Pasteur: Refutation of the spontaneous generation theory with infusoria
1937 - Sonneborn: Non-mendelian inheritance of sexual type in paramecium
Elizabeth blackburn & Carol Greider: 1985 - Telomeres and telomerases in Tetrahymena (Nobel Prize 2009)
Eric Meyer, Sandra Duharcourt
(IBENS, I. Jacques Monod 2014)
Sexual type in paramecium are transmitted by maternal RNAs, not by DNA
modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences
Conclusion on the capping
Laura landwebehr 2020
Oxytrichia trifallax
A outAT score 20 isQv20 (812 seq)
A outAT score20 idQv20 + Strong BH correction (176 seq)
ipdRatio out GATC before filtering BH vs after (qv20/idqv20)