Single-molecule DNA methylation analysis by SMRT sequencing of short inserts

DELEVOYE Guillaume - 1rst year PhD student

Supervisors: Eric Meyer, Mathieu Bahin

ANR Meeting 08/04/19

 

Modified bases: A large landscape of possibilities

Thymine = 5-methyl-uracile ?

<-- 3 known frequent

methylations in DNA that we can detect with PacBio

25X

25X

250X

PacBio sequencing

99% accuracy

Max

75% accuracy

Unusual Slowing: Modification Score (Qv)

Kinetic signature: Identification score (IdQv)

Trained model (ML) allows detection of suspect downturns of polymerase (function of the -3/+8 nt context) --> IPD are captured

~ PHRED scores

Theory: Experimental strategy

Classical PacBio Approach: Higher coverage by overlapping the holes

Our approach: Shorter, real single hole analysis, much more passes

We ~always have either 0X or >>> 25X

Majority of MDS come from the MAC

Step 1: Sorting the sequences

Some technical issues occured on the way

Murphy's Law statistically hits a lot if you're trying 500.000 times

Create and align the consensus

0 - Quality filter (Z-score)

 

1 - Create the consensus (=CCS)

 

2 - Map the CCS on MAC / MIC / MAC+IES (BLASR)

--> Only the best alignment reported (forced)

 

3 - Filter at >99% identity on at least one genome

 

4 - Compare the mapping

Reminder: CCS are expected to be somewhat around 99% accuracy

Differential mapping

Sequences that come from MIC

N changes

"=" becomes "I"

Diffential mapping

Sequences that come from MAC

Step 2: Removing the pauses

A lazy polymerase

The polymerase makes random pauses, that are not linked to DNA modification

What's in PacBio's statistical blackbox

How PacBio handles it

--->> Values are "capped":

cappingValue = max(99th chunk, 4* modele, 75th local percentile)

  • We need the mean
    • p-value, PHRED... Rely on the mean
    • Very sensitive to outliers
    • Pauses must be removed !
  • Some context of nucleotides are naturally slower than others
    • Context is important
    • Pauses should represent far less than 1% of all IPDs values

For every position in the reference:

Should the capping be the same when using a single hole approach ?

cappingValue = max(99th chunk, 4* modele, 75th local percentile)

It doesn't change much

But it's rigorously not the same

Step 3

Analyse the methylation

What threshold scores should we use ?

The score threshold

The modification/identification scores are sold by PacBio as PHRED scores

For our experiments: we should have about ~30% of modified bases if this is true

--> Modification scores are NOT prhed scores (at least in our case)

Modification scores overall

 

Score distributions: The case of the adenines

 

Can we apply the GMM to other samples ?

Score distributions: The case of cytosines

Score distributions: The case of G and T

Available Data

 

HT2

HT6

HTVEG

 

MAB

MT1A-1B

MT1A-1B-2

MT2

NM9_10

NM4_9_10

WT

 

Silenced

Sorting stats

Using only the GMM

Adenines

All A in GMM

  • Same MIC/MAC in AT
  • No difference between experiments
  • ~95% methylated symetrically

 

All A in GMM

out the AT

HTVEG

~ Same for every silencing experiment

 

Lack of sequences (~50 VS ~2.000) don't really allow comparaison between MIC and MAC

Using idQv20 + GMM

Adenines

GMM + idQV20

 

  • Raises the % located in AT
  • Methylated fraction goes down to 0.9-1%
  • NM4-9-10 goes down to 50% in the MAC
  • Logos don't change significantly

Cytosines

Qv20/IdQv20

Logo from HTVEG MAC

Identical everywhere

Conclusion

Number of MAC_IES sequences: will it be enough ?

Capping ok

In silico control --> Experimental

Coverage threshold single-hole ?

Threshold is still a +/- open question

 

Diminution m6A in some silencings, but which one disappears ?

2.5% m6A in MAC <-> Bad calibration ? MDS missing ? Lack of sensibility ?

NM4_9_10 --> 10% ? Recalibrage par optimisation ?

 

m4C signal: HT2 = HT6 < HTVEG

 

Sorting stats 2

 

(ANR) Single-molecule DNA methylation analysis by SMRT sequencing of short inserts

By biocompibens

(ANR) Single-molecule DNA methylation analysis by SMRT sequencing of short inserts

Lab meeting - 19/06/18

  • 107