End-to-End Deep Learning Model for Base Calling of MinION Nanopore Reads

Neven Miculinić

Associate prof. Mile Šikić,  PhD

 

The University of Zagreb,

Faculty of electrical engineering and computing

MinION

Raw signal

Base pairs

Solution?

  • Rule engine
  • HMM
  • Deep learning

Existing Solution

  • (defunct) Metrichon
  • Albacore
  • Guppy
  • Chiron
  • ...

MinCall

  • End2end
  • GPU accelerated
  • Deep learning model
  • Uses well known CNNs with CTC loss and beam search
  • Added autoencoder loss to speed up training

CTC loss

CTC loss

CTC loss

P(\pi | X) = \prod_{t=1}^{m} s_t(\pi_t)
P(πX)=t=1mst(πt)P(\pi | X) = \prod_{t=1}^{m} s_t(\pi_t)

Decoding

P(Y | X) = \sum_{\pi \in decode^{-1}(Y)}^{} P(\pi | X)
P(YX)=πdecode1(Y)P(πX)P(Y | X) = \sum_{\pi \in decode^{-1}(Y)}^{} P(\pi | X)

Greedy search

Beam search

Training detail

  • Dataset
  • Architecture
  • Results

Dataset

Training dataset:

Jared's Simpsons R9.4 E.coli

Test dataset:

Ryan's Wick R9.4 Klebsiella pneumoniae

Preparation

  • Basecalled with metrichon (positional data)
  • Aligned with graphmap
  • Corrected
  • Transformed to protobuf

Preparation

syntax = "proto3";

package dataset;

enum BasePair {
    A = 0;
    C = 1;
    G = 2;
    T = 3;
    BLANK = 4;
}

enum Cigar {
    MATCH = 0;
    MISMATCH = 1;
    INSERTION = 2; // Insertion, soft clip, hard clip
    DELETION = 3;  // Deletion, N, P
}

message DataPoint {
    message BPConfidenceInterval {
        uint64 lower = 1;
        uint64 upper = 2;
        BasePair pair = 3;
    }
    repeated float signal = 1;
    repeated BasePair basecalled = 2; // What we basecalled
    repeated BPConfidenceInterval labels = 3; // labels describe corrected basecalled signal for training
}

Preparation

Architecture

Read Results

Read Results

Read Results

Speed

Consensus Results

Consensus Results

identity rate
minion_b0 99.9671
minion_b50 99.9604
chiron_v0.3 99.9957
albacore_v2.2.7 99.9904
guppy_v0.5.1 99.9907

Questions?

minion

By Neven Miculinić

minion

  • 363