Statistics in action

Insights from a bioinformatician in a public health department

Dec. 15th 2025

Guillaume DELEVOYE

Postdoc researcher, Karlstad University, Public health department

guillaume.delevoye@kau.se

Today's talk

Not a course

No grade, no evaluation

What's research like ?
What do public health researchers do ?
Perspectives after the bachelor's degree ?
Cool stuff in statistics ?
Discuss together

Objectives :

About me

Postdoc researcher, bioinformatician

Public health department, KAU

2.000 pairs : pregnant woman + child

Study the impact of endocrine disruptors on children's health

My focus/background/expertise :

Epigenetics
DNA methylation data

What's a bioinformatician

A bioinformatician is a Jack-of-all-trades

Not a medical doctor
Not a geneticisit
Not a software developer
Not a statistician
Not a chemist, or physicist
Not an AI scientist
Not a web designer
Not an expert in IT and network infrastructure

... But must have some knowledge, in all of the above

How can you be knowledgeable on so many things ?

> You can't

Bioinformatics, like stastistics, is a very humbling field
You feel dumb, every day, several times a day
The imposteur syndrome is real !
You quickly learn to be brutally honest about your (relative) ignorance

Being curious and self-taught is key

Despite being very inter-disciplinary, bioinformatics is still a lot about statistics

Like public health

Learning statistics is not fun

Spending years on learning how to read is not fun
- But reading books is fun !
Spending years on learning a music instrument is not fun
- Playing with your friends is fun !
Learning statistics for years is hard, and not fun
- Making discoveries with statistics is fun !

Note

I used to hate statistics
Always failed statistics as a student
I finished my bachelor without validating my statistics module

... Yet here I am.

Because, actually, it is fun !

""The best thing about being a statistician is that you get to play in everyone's backyard""

John Tukey (1915-2000)

Fast-Fourrier Transform
Box plots
Tukey's test
ANOVA
"bit"
"Software"

Other contributions :

Defense industry, telecommunications, sexual orientation research, Ozone layer damages, TV polls, analysis of elections, Education, Printer market, Pharmacy, ...

The better you are at statistics, the more fun you'll have in inter-disciplinary research

There is always something you know that your colleagues don't !
- And vice-versa.
- Including other bioinformaticians !
- It's a reciprocally enriching exchange !

You will always pale in comparison to an expert

Every bioinformatician is vastly self-taught :

But,

... And that's kind of OK actually

Statistics in action

Some of my past projects

Example n°1

Developping a machine-learning web app to estimate the donor's compatibility in a blood marrow transplant

Very advanced medical topic
Can require very advanced genetic analysis
Advanced machine-learning methods

There is so much to learn everyday !

Ex : Negative ages ?

Example of deliverable : Predict mortality at 5 year using machine-learning

Here : Random forest

Example of deliverable : Interpretable decision trees

Often, bioinformaticians have other missions as well
- Develop a web app
- Deploy it / Host it
- Setup the DNS, HTTPS, etc
- Maintain the server online
- Handle its cybersecurity
- Making sure the encryption is correct
- Fix the bugs
- Update drivers correctly
- Write documentations for the medical staff
- etc.

The IT-side of the job

Example : doi: 10.1038/bmt.2016.162.

The domain-knowledge is key !

But ! It is impossible to start working on bone-marrow transplant without basic knowledge in :

Hematology
Immunology
Genetics
Oncology
...

Medical doctors often have very limited time, and their meeting are often interrupted by emergencies
- You must come prepared to the meeting !
- The more you know on the topic, the better the interaction with the medical staff

A priori it's just statistics like any other

Example n°2

Looking for DNA-methylation markers of Endocrine Disruptors in children's blood

Example of deliverable : Quality control, outlier identification

Outlier detection

Ex : Identifying intersex participant in SELMA data

Domain-knowledge is key

Example of deliverable : Visualizing exposure to multiple chemicals, in 2D

Example of deliverable :

DBScan clustering, after projection in 2D of all the chemical exposures (SELMA data)

Positions in the human genome where DNA methylation of blood cells is affected by phtalates

False Discovery Rate control ...

Deconvolution of gaussian data

After deconvolution

Example n°3

Discordant diagnosis test among couples

Problematic

Take a COVID-19 test
- Sensitivity: Se = 99%
- Specificity: Sp = 99%
Test 1.000.000 couples
- 979.000 couples are D-/D-
- 1.000 couples are D+/D+
- 20.000 couples are D+/D-

... But the test is not perfect !

Se = Sp = 99%

Seamingly logical conclusion :

COVID-19 is not very transmissible ?!

Update quarantine guidelines ?!

Let's do the maths ! (1/2)

$$P(+/+)$$

Let

$$FD_2$$

is the observed proportion of +/+

Where

$$FD_1$$

is the observed proportion of -/+

The unbiased fraction of truly +/+ couples is then :

But : Almost 100% of the couples detected D+/D- are actually COVID-/COVID- !

979.000 couples are D-/D-

1.000 couples are D+/D+

20.000 couples are D+/D-

Let's do the maths ! (2/2)

Reason = Imperfect test (Se = 99%, Sp = 99%)

Applications in other fields

Ex:

Detecting DNA methylation in palindromic motifs

Methylated adenines can also be called with Se = 99% and Sp = 99%, like our patients before

Since there are two adenines in the couple, same logic applies

Example n°4

A web app to study DNA methylation in ciliates

As part of my PhD

Each individual DNA molecule is immobilized and replicated by a DNA polymerase

Detecting DNA methylation

Nucleotide context (-3/+8nt)

Kinetic signatures

Depends on

DNA modifications

Detecting DNA methylation

Adenine

6mA

ln(ipdRatio) ~ N(0,1)

log(ipdRatio)

1. Normalized by speed in unmethylated DNA :

2. Then : Usual parametric test

$$ipdRatio= \frac{MeanIPD_{experience}}{unmethylated\ control}$$

Example of deliverable : Web app

Exemple of deliverable :

An analysis package coded in python (several thousands of liness)

Exemple of deliverable :

4 years of extremely hard work for a harshly non-significant p-value

We can never exclude the hypothesis that 6mA comes from the MAC

p ~ 0.99

Conclusion

In a nutshell :

Learn

Teach

Understand

collaborators

Communicate

Animate meeting,

work groups,

workshops

Be curious

Sometimes be diplomat

Understand training & cultural differences

Share data, methods, results

Sometimes stand your ground

Bioinformaticians and statisticians often work in very interactive groups
Social skills are extremely valued

You need to

How does one become a bioinformatician ?

> You don't

Almost 100% of bioinformaticians have a previous training in a different field :

Biology, medicine, chemistry, physics, public health, statistics, programmer, data scientist...

>You< could become a bioinformatician !

Example of background :

2017: Pharmacy (Lille)

2022: PhD degree at ENS Paris (bioinformatics)

2018: MSc Bioinformatics - Paris Diderot

2023 : Karlstad University, Public Health

About competition

Academic and medical research can be ridiculously competitive
Pressure to publish is somewhat lower for bioinformaticians
- You're much needed anyway
- Your skills are rare / niche
- Permanent positions are atteignable
- This tends to change

You can really play in many people's backyard !

University will teach you only what is extremely basic.
Don't forget to learn also by :

Reading
Watching videos
Making mistakes
Thinking about it in the shower
Talking to people
Inventing new weird methods
Having fun

One last word

Thanks :)

Important note

I never passed my statistics exams

Not even

Part II / Statistics in action

Text

Current position

What

Code

Stats

Hematology (HSCT)

Ciliates

NGS

Structural biology

Python
C++
Bash
R

1) Paramecium tetraurelia

3) DNA methylation

4) PacBio sequencing

2) Transposable elements / IESs

5) hemi-methylation of

palindromic motifs

(AT sites)

Methylase candidates in Paramecium tetraurelia

Candidates identified:

NM4, NM9, NM10
Another family: MT1a, MT1b, MT2

2) PacBio sequencing with short inserts | 6mA ++

1) Grouped silencings by sequence homology

PacBio SMRT (short inserts)

Inserts up to 350bp

Sequencing : unique molecule & both strands

Polymerase slowing ~ methylated adenine

The random sampling strategy

Purify the MIC: Impossible
Sequencing whole MIC from whole DNA : Impossible

Solution: Random sampling
- 1 among 200 molecules
- Not all with cary an IES at all
- ~20 to 100 IES only per experiment

Problem n°2

IESs are sometimes retained in the MAC

What is P(MIC|IES+) ?

Other analysis challenges

(skipped)

Details that I spare you (but don't hesitate to ask) :

Random sampling of the MIC
Sorting MIC and MAC
Sorting IES specific and MIC non-IES sequences
Filter IES retention
Estimate Se and Sp
Retro-ingineering PacBio's formats
Recoding parts of the softwares for single molecules
etc.

Detecting n6mA

Imperfect detections

No scientific method is perfectly reliable

In our case:

Sensitivity = P(D | M) > 92%
Specificity = P(ND | NM) > 99.8%

> Expect False Positives and False Negatives

Imperfect detections (2/2)

Sensitivity = P(D | M) > 92%
Specificity = P(ND | NM) > 99.8%

We can use 4 extreme scenarii :

... And see what is consistent no matter the scenario we place ourselves into

Finding unbiased estimators for and FDR

If p number of positive detections among N tests:

p = FP + TP

$$\pi$$

So,

Which means

with fraction of 6mA

$$\pi =$$

Debiased levels of m6A in the MAC

> 95% n6mA are located in AT sites

DNA n6-mA around/in the IESs

Results

Methodological development to correct hemi-methylation detection

Let FD1 and FD2 be resp:

Fraction of AT sites detected hemi-methylated
Fraction of AT sites detected symmetrically methylated

PZ0, PZ1, PZ2: unbiased estimators of non, hémi, symetrically methylated AT sites

Then:

With

Debiased Hemi-methylation in MAC

2) Transcient in the new forming MAC

1) Constant pattern in the MIC

Transcient ?

Our hypothesis:

Analysis showed that:

~2.5% of n6mA in Paramecium (MIC) : Cummings 1974

?

> Actually way lower (if any)

> Unexpectedly, NM and MT proteins play a whole another role in the MAC,

not the MIC

Conclusion

IGEM2014

MT and NM families :

In the MAC

Convertase activity DNMT1-like

But: Hemi-methylated AT sites kept after many mitosis

New hypothesis:

Symmetrical methylation of AT sites = Mitotic clock or maturity indicator for the cell to be allowed to go enter meiosis

A bit more about me...

I don't like

Red socks

Fun

Windows 10 update of May 2021

Some keywords for my PhD

1) Paramecium tetraurelia

3) DNA methylation

4) PacBio sequencing

2) Transposable elements / IESs

5) hemi-methylation of

palindromic motifs

(AT sites)

P. tetraurelia

Unicellular eucaryote with 3 nuclei:

1xMAC nucleus (up to 800n)
2xMIC nuclei (2n)

DNA ratio: 1 MIC for 200 MAC

The difference ?

MAC genome compared to MIC =

Amplified+++
Free from TEs
Transcriptionnally active
Not transmitted to progeny

Genome invaders in Paramecium tetraurelia

In the MIC:

45.000 unique sequences
- Small (<27bp), Non-coding
- Lots present in the CDS
- Remnants of TE ?
- 100% TA-bounded

--> How ?

Present in the MIC

Absent in the MAC

= Excised generation after generation

2) Transcient in the new forming MAC

1) Constant pattern in the MIC

Transcient ?

Recognition of IESs: Our hypothesis

Our hypothesis:

~2.5% of n6mA in Paramecium (MIC&MAC) : Cummings 1974

Methylase candidates

Candidates:

NM4, NM9, NM10
Another family: MT1a, MT1b, MT2

2) PacBio sequencing with short inserts | 6mA ++

1) Grouped silencings by sequence homology

PacBio SMRT principle

Reads up to 80 kbp

1) Sequencing (unique molecule)

2) High fidelity consensus

3) DNA methylation analysis

What are the results ?

2) Crosstalks in the new forming MAC

1) Constant pattern in the MIC

Transcient ?

Our hypothesis:

Prelimary analysis shows that:

~2.5% of n6mA in Paramecium (MIC&MAC) : Cummings 1974

Details invegetative MAC

~95% of n6mA locates in AT dinucleotides in the MAC
75% of the methylation in an AT dinucleotide is actually symetrically modified

Bulk of methylation:

Total = 1.2 to 1.5% of adenines

0.6% of the adenines outside AT sites:

Now a bit more details...

Imperfect detections

No scientific method is perfectly reliable

We don't care about Se and Sp
We care about the fact that eventual mis-estimations of them doesn't really change anything

In our case:

Sensitivity = P(D | M) > 92%
Specificity = P(ND | NM) > 99.8%

Finding unbiased estimators for and FDR

If p number of positive detections among N tests:

p = FP + TP

$$\pi$$

So,

Which means

with fraction of 6mA

$$\pi =$$

Debiased levels of m6A in the MAC_TA inserts

Methodological development to correct hemi-methylation detection

Let FD1 and FD2 be resp:

Fraction of AT sites detected hemi-methylated
Fraction of AT sites detected symmetrically methylated

PZ0, PZ1, PZ2: unbiased estimators of non, hémi, symetrically methylated AT sites

Then:

With

Debiased Hemi-methylation in MAC_TA

Conclusion

In the MIC:
- Probably no m6A after all
- Hard to imagine a role in TE/IES excision
In the MAC
- Our 6 genes are implied in the bulk of the MAC m6A
  - Sym-methylated -> Hemi-methylated
- MT and NM families: Functional analogs of DNMT1 ?
  - Not really: Not kept after replication
- Lots of questions raised by the MAC methylation:
  - TSS ?
  - IES excision Junctions ?
  - Nucleosome positionning ?
One important phenotype:
- NM4-9-10 somehow makes the cell unable to go into autogamy

Thanks :)

Role of DNA-6mA in Paramecium tetraurelia

Guillaume DELEVOYE

3rd Year PhD student

Bioinformatics

P. tetraurelia: Genomic architecture

Unicellular eucaryote with 3 nuclei:

2xMIC nuclei (2n)
- Germline nucleus
- Contains: TE & IES
  - No transcription outside meiosis
1xMAC nucleus (up to 800n)
- Somatic nucleus
- Amplified and "fixed" version of the MIC
- Free from TE and IES
- Transcriptionnally active

DNA ratio: 1 MIC for 200 MAC

+ DIfficult to purify the MIC DNA

Profiling the IESs

Non-coding
Excised after sexual processes
Remnant of TE ?
Most recent IES are very short (< 27bp), and are the majority of IESs

45.000 Unique sequences

How are IESs recognized ?

(1/2)

100% TA-Bounded
Weak consensus TAYAG
- Degenerated TC1-Mariner TE insertion site
- Not sufficient
Periodic size distribution

30% of IESs only are small-ncRNA dependant (shown by DICER-like2-3 silencing)

What about the majority remaining ?

How are IES recognized ? (2/2)

The sc-RNA pathway

The DNA methylation hypothesis

6mA likely to be abundant in Paramecium:

Suspected 2.5% in the MAC of P. aurelia by Cummings et Al (1975)
- Also documented 6mA in the MIC
Detected methylation by SMRT in the MAC in our lab (unpublished data)
Detection by SMRT in the MAC by Sandra Duharcourt's lab (unpublished)
Detection in Oxytrichia by L. Landweber et Al (2019)
no 5mC, 4mC in the MAC a priori

2) Crosstalks in the new forming MAC

1) Constant pattern in the MIC

Transcient ?

Methylase candidates

Candidates:

NM4, NM9, NM10
Another family: MT1a, MT1b, MT2

2) PacBio sequencing with short inserts | 6mA ++

1) Grouped silencings by sequence homology

PacBio

TL;DR overview

PacBio SMRT principle

Reads up to 80 kbp

1) Sequencing (unique molecule)

2) High fidelity consensus

3) DNA methylation analysis

PacBio SMRT principle (2)

Strategy 1: Long inserts (long reads)
- Ideal for assembly of long repeated sequences
- Poor resolution for DNA methylation analysis
Strategy 2 : Short inserts (long reads)
- Much higher resolution for DNA methylation analysis

Step 1: Consensus

99% accuracy

Max

75% accuracy

Because our inserts are circular and shorts, we can make CCS of high accuracy despite a 15% error-rate

Step 2: Sorting

Deduced origin

MIC DNA

Alignment of consensus

Step 3 : DNA methylation analysis

Single-molecule
Single-nucleotide resolution
Independant yet pairable analysis on both strands

Some more details:

the random sampling approach

Deduced origin

MIC DNA

Alignment of consensus

Only a few remaining: ~ 10 to ~200 sequences

100% should carry a methylation pattern

1 out of 200 comes from the MIC
1/6 of MIC inserts will carry an IES
For 50% of IESs, we cannot be sure whether it come from the MIC or from the MAC
30% of the remnants are scanRNA dependant

Final categories

MDS
- = "MAC" for 99% of sequences
MAC_IES
- "true" MAC_IES (never seen retained)
- other MAC_IES (sometimes retained)
MIC
- Other MIC specific (TE, repeated sequences)
MAC (TA Junction)
- Overlap a TA junction of excision
rDNA
mtDNA
Other:
- Contaminants
- Alternative excision boundaries ("LOWID")
- low identity consensus ("Trash")

"genome"

Total Paramecium

Total sequencing

Little bug to be corrected

30%

MDS and MAC_TA_Junction represent a vast majority

MIC specific and IES are as rare as expected

+ Applying the retention filter divides the number of "MAC_IES" approximately by 50%

Detecting m6A in details

Detecting m6A optimally

H0: ipdRatio of umethylated-Adenines

H1: ipdRatio of 6mA

S: Threshold on the pvalue

--> Specificity

--> Sensitivity

$$\alpha$$

$$1 - \beta $$

Adenine

6mA

ln(ipdRatio) ~ N(0,1)

log(ipdRatio)

A pvalue is just the probability that log(ipdRatio) in the tail of H0

Using E.coli as ground truth

E.coli:

Feeds paramecium: contaminants+++
Nearly 100% symetrically-methylated with m6A
- GATC
- EcoK
- Few others
  - Depends a lot of the strain
Outside of GATC and EcoK: very low levels of m6A

I investigated the PacBio's output on it's GATC & EcoK VS other sites

How log(ipdRatios) look in E.coli

Separability raises with coverage, which is expected

PacBio pvalues do have some biological meaning

PacBio pvalues do have a meaning...

...but are not ideally distributed

Ideal pvalues

--> Allows magic !

PacBio's

Allow estimation of
Allows optimal, adaptative FDR control

$$\pi_{0}, \pi_{1}$$

Just don't
Obvious point n°1:
- log(ipdRatio < 1) -> -inf

Normality assumptions under H0 are broken under high coverages

The higher the coverage, the worse (here, >40x)
Will cause bell-shaped pvalues under null
- Hidden phenomena on the previous curve

Normality assumption matters

All Adenines' pvalues [E.coli] coverage > 40X

Long story short

PacBio's pvalues:

Are biologically relevant
- We can build a reliable ad-hoc system with them
Are produced by a linear combinations of > 150 different coverages
- Which forbids the usual statistical treatments
Are somehow broken on a coverage-dependant manner
- Which forbids a simple fix for point 2
- We can't use the classical statistical treatments directly on pvalues

Other PBio's scores for m6A

For n6mA, PacBio produces:

A modification score --> Slowing of the polymerase
- pvalue against H0 only (the one we presented earlier)
An identification score --> Kinetic signature of a modification
- loglikelihood between H0 and H1 (H1 = Other modifications or secondary peaks)

They are PHRED-transformed p-values of two different statistical tests, that rely on the mean of the IPDs

PHRED scores

The scores (Qv) are PHRED-transformed p-values

Typical covscore plot

Modification score / coverage

Using flat threshold on modification score = Hudge lack of power

Solution : A coverage-dependant threshold on the scores

From now on

"positive detection"

score > linear thershold

(only >25X considered)

Benchmarking

How good (or bad) is our method ?

$$Se = P(D^+|M) ~ 92\%$$

$$Sp = P(D^-|NM) ~ 99.8\%$$

Starting from sufficient coverages (~20X to ~30X), Se and Sp don't depend on the coverage anymore

The 4 scenarii for Se and Sp

We don't care about Se and Sp
We care about the fact that eventual mis-estimations of them doesn't really change anything

Finding unbiased estimators for and FDR

If p number of positive detections among N tests:

p = FP + TP

$$\pi$$

So,

Which means

And:

What it gives in Paramecium

Debiased levels of m6A in the MAC_TA inserts

Details in MAC HTVEG

~95% of the methylation locates in AT dinucleotides in the MAC
- True in any condition
75% of the methylation in an AT dinucleotide is actually symetrically modified, independantly from being in the MAC or the MIC

Kept in MIC and MAC

All conditions

MAC_TA outside of AT sites

Detections outside AT sites are likely to be at least partly true positives

Present in all samples

Never erased

Conclusion: We are either in scenario 1 or 3

(Sp largely underestimated)

Was expected but confirmed

Methodological development to correct hemi-methylation detection (1/2)

Let FD1 and FD2 be resp:

Fraction of AT sites detected hemi-methylated
Fraction of AT sites detected symmetrically methylated

PZ0, PZ1, PZ2: unbiased estimators of non, hémi, symetrically methylated AT sites

Then:

With

What it gives in Paramecium

Debiased Hemi-methylation in MAC_TA

n6mA in the MIC (preliminary)

Pure MIC sequences: Cannot yet be trusted (error in the pipeline)

Coverage >= 40X made us lose too many materials, should restart with >=20X

n6mA in the MAC_IES rarely retained ("TRUE MAC IES")

HTVEG

MT2

--> Some molecules carry all the detections, in sym-A*T

Very likely to be sequences comming from the MAC to my opinion

At first look, seems like the same in all samples

Conclusion

Lots of sweat spent on methods : Now we start the cool things
In the MAC
- Quantification in the MAC is well-characterized in all the samples
- Our 6 genes are implied in the bulk of the MAC m6A
- MT and NM families: Functional analogs of DNMT1 ?
- Lots of questions raised by the MAC methylation: TSS, IES Junctions, nucleosome positionning...
In the MAC IES
- Very preliminary analysis between TRUE mac IES and other MAC IES tends to show that all detections could come from accidental retention in the MAC, and the MIC could actually carry 0% m6A or very lower levels than 1%. Really to premature to be sure
- HT2 and HT6 are still potentially full of surprises
TE in the MIC: No idea yet
It's just a question of time now ! Which we asked (prolongation LABEX memolife)

Sorry for the headaches !! Thanks for your time :)

Ciliates are great model organisms

1862 - Pasteur: Refutation of the spontaneous generation theory with infusoria

1937 - Sonneborn: Non-mendelian inheritance of sexual type in paramecium

Elizabeth blackburn & Carol Greider: 1985 - Telomeres and telomerases in Tetrahymena (Nobel Prize 2009)

Eric Meyer, Sandra Duharcourt

(IBENS, I. Jacques Monod 2014)

Sexual type in paramecium are transmitted by maternal RNAs, not by DNA

2.1 Retroingineering

The capping of IPDs

modelPrediction is the predicted IPD value by the model in a given context of nucleotides at this position
globalIPD is the mean of all the IPD values of the read.
localIPD represents all IPDs that have been mapped at a given position in the genome, including those from other sequences

Conclusion on the capping

Isn't coded as advertized by PacBio
The way it's implemented for AggSN is problematic and doesn't really make sense
Paradoxally, it should be more relevant for our approach than for the default one
We expect no methylation to be undetected due to the capping