Piotr Grzesik
dr hab. inż. Dariusz Mrozek, prof. PŚ
Silesian University of Technology
Metagenomics - study of genetic material from environmental samples. It is used to sequence DNA (or RNA) based on the prepared sample, identify microorganisms, detect potential mutations or identify previously unknown species.
Nanopore sequencing - developed by Oxford Nanopore Technologies, it is a process of DNA sequencing that works by monitoring changes to an electrical current caused by DNA strand passing through a nanopore. The signal that is obtained as a result is decoded to specific DNA or RNA sequences. The process of such decoding is called basecalling.
MinION Nanopore - portable sequencing device, released by Oxford Nanopore Technologies in 2014. It is the first device that enables portable sequencing at affordable price (1000$). It is powered via USB, weights under 100g, which makes it possible to use it as a field device.
Edge computing is a computing paradigm that brings the data processing and storage closer to a place where it is needed. It allows to reduce the volume of data that needs to be send over the Internet, allows to improve reaction time to the changing state of the system and improves resilience and allows for data loss prevention where Internet connection is not reliable or not available at all most of the time.
The considered analysis workflow consist of three separate steps - sequencing, basecalling and classification. During the first step, MinION device will sequence DNA, outputting electrical current measurements in form of FAST5 files. In the second step, basecaller will interpret electrical current and output genomic sequences. In the last step, genomic sequences will be classified and labeled with taxonomic labels.
Simulated MinION Nanopore that outputs reads in form of fast5 files
Edge device that runs basecalling and classification analysis with SSD Disk
Cloud-based service, long-term storage of sequenced data
Power supply with INA219 + Raspberry Pi 4 for current measurement
Fast5 files with data from sequencing runs containing material of Escherichia coli and Klebsiella Pneumoniae
Measurement of samples processed per second by each basecaller in different power modes
Both Guppy and Bonito were tested with and without GPU acceleration enabled
During each run, current was measured with INA219 + Raspberry Pi 4 circuit
Classification with Kraken2 was also ran in different power modes
Available power modes in Jetson Xavier NX: 15W 6 CORE, 15W 4 CORE, 15W 2 CORE, 10W 2 CORE, 10W 4 CORE
Samples per second processed by Guppy with Fast model
Samples per second processed by Guppy with HAC model
Average power for Guppy with Fast model
Average power for Guppy with HAC model
Guppy with Fast model is able to support real-time basecalling for up to 3 MinION devices at the same time
Jetson Xavier NX is a sutiable device for running such experiments in places with limited network connectivity and limited power supplies
Using 10W power mode on Jetson Xavier NX is more efficient energy-wise than 15W power mode
CPU basecalling is not feasible on Jetson Xavier NX device (on similar ones as well)
As of right now, it seems like Guppy is the only basecaller suitable to use at the edge
Optimize alternative basecallers for edge use
Work on alternative approach where instead of basecalling + classification, raw signals are used to map against expected microorganisms
Results published as a part of "Grzesik, Piotr, and Dariusz Mrozek. "Metagenomic Analysis at the Edge with Jetson Xavier NX." International Conference on Computational Science. Springer, Cham, 2021."
Serverless computing is a computing paradigm that takes advantage of simple, stateless functions (also called Functions-as-a-service) that offer low maintenance overhead, fault tolerance, support massive parallelism, allocate resources on-demand and can quickly scale both up and down. One additional benefit of this paradigm is that users pay only for actual invocations of functions and not for idle time.
Serverless computing is also getting more popular in the literature for bioinformatic purposes:
(Added at the end of 2020)
In proposed workflow, first step is uploading FAST5 files from MinION Nanopore to S3 Bucket. In the next step, the processing is triggered manually and first Lambda function splits the FAST5 files into batches and schedules execution of multiple Lambda functions that run basecalling operation and save results to S3 bucket as well.
Fast5 files with data from sequencing runs containing material of Escherichia coli and Klebsiella Pneumoniae
Measurement of samples processed per second by each basecaller and per second per MB of memory for different models
Both Guppy and Bonito were tested
Experiments were run for 256, 512, 1024, 2048, 4096, 6144, 8192 and 10240 (maximum) MBs of RAM available to a single Lambda function
Samples per second processed by Guppy with Fast model
Samples per second per MB of memory for Guppy fast model
Samples per second processed by Guppy with HAC model
Samples per second per MB of memory for Guppy high accuracy model