BIOSC 1540: L03A (Genome assembly)

Computational Biology

(BIOSC 1540)

Jan 21, 2025

Lecture 03A

Genome assembly

Foundations

Announcements

CByte 01 is live and will expire on Feb 1
CByte 02 will be released Friday (Jan 24) and expire on Feb 7

Quiz 01 is next week (Jan 28) and will cover lectures 02A to 03B

Assignments

Assignment P01B is due Friday

Quizzes

CBytes

ATP until the next reward: 1,903

Next reward: Checkpoint Submission Feedback

After today, you should have a better understanding of

Where genome assembly fits in the genomics pipeline

High-throughput sequencing produces short DNA fragments called reads

Modern sequencing technologies generate millions to billions of short reads (i.e., DNA fragments) from a DNA Sample

Reads are typically 100–300 base pairs long for short-read technologies and up to tens of kilobases for long-read technologies

DNA sample of unkown sequence

Reads

Sequencing

Reads are assembled into contigs by identifying overlaps between them

Reads overlap where they represent the same genomic region

Assembled DNA sequence (i.e., contig)

Reads

Assembly

Overlap information is used to merge reads into contiguous DNA sequences called contigs

Assuming perfect sequencing and assembly, the resulting contig will match our original DNA sample

Genome assembly bridges sequencing data and biological insights

Assembled genomes are essential for identifying genes and understanding regulatory elements

Provides a foundation for downstream analyses, including functional and structural genomics

After today, you should have a better understanding of

Types of genome assembly

Reference-based

Reference-based assembly aligns reads to a known genome

Sequencing reads are matched to a reference genome to determine their correct positions

Alignment relies on identifying overlaps and shared sequences between the reads and the reference

RefSeq provides high-quality reference genomes, transcriptomes, and proteins

Mapping reveals variations like SNPs and small insertions or deletions

Variations occur when a read differs from the reference genome

Single-nucleotide polymorphisms (SNPs) are single base changes between the read and the reference

Indels are small insertions or deletions that alter the alignment pattern

Reference-based assembly is ideal for organisms with well-annotated genomes

Works effectively when a complete, accurate reference genome is available.
It is commonly used for model organisms like humans, mice, or fruit flies with high-quality reference genomes, not recommended for novel organisism.

Example: GRCh38.p14 for Humans

Reduces time and cost for studies focused on variant detection or evolutionary comparisons.

Regions absent in the reference genome result in gaps in the assembly

Reads corresponding to regions missing in the reference cannot be mapped, leaving unassembled gaps.

Reads

Missing gap

in reference

Inaccurate assembly

These gaps can affect downstream analysis, especially for novel genes or functional elements.

Gaps can occur due to incomplete reference sequences or highly divergent regions in the sample genome.

Structural variations can be overlooked or incorrectly assembled

Variations like insertions, deletions, inversions, or translocations may not align correctly to the reference.

Failure to account for structural variations can skew results and mask important genomic differences.

Assemblers may interpret these variations as mismatches or sequencing errors.

After today, you should have a better understanding of

Types of genome assembly

De novo

De novo assembly reconstructs genomes without a reference

It does not rely on pre-existing data, allowing for unbiased genome reconstruction. Essential for novel organisms or those with no reference genome.

Instead of mapping to a reference, reads are assembled by finding overlaps between reads and merging them

De novo assembly captures the full genome, including novel regions

Unbiased assembly enables the discovery of unique and divergent sequences.

Resolves structural variations that reference-based methods might miss.

Ideal for exploring non-model organisms and highly variable regions.

De novo assembly faces computational and biological challenges

High computational requirements due to complex algorithms

Most methods use graph-based methods (more on this in the next lecture).

Struggles with repeats, sequencing errors, and low-coverage regions (more on this later).

Reference-Based vs. De Novo Genome Assembly

Researchers are analyzing the genome of a newly discovered bacterial strain suspected to carry antibiotic-resistance genes. They have access to a draft reference genome from a closely related strain, but it is incomplete and poorly annotated. Their main goal is identifying novel resistance genes while ensuring assembly accuracy and minimizing computational costs. Which approach would you recommend for assembling the genome, and why?

A. Use reference-based assembly to ensure computational efficiency and focus on conserved regions.

B. Use de novo assembly to avoid reference bias and discover novel resistance genes.

C. Use hybrid assembly, starting with reference-based assembly and refining with de novo assembly for poorly aligned regions.

D. BLAST reads that fail to align to the reference genome but avoid de novo assembly to reduce computational cost.

After today, you should have a better understanding of

Challenges in genome assembly

Genome assembly faces biological and technical challenges

Biological factors: Repetitive sequences, structural variations, and genome size.

These challenges complicate the process of accurately reconstructing a genome.

Technical issues: Sequencing errors, low coverage, and short read lengths.

Overcoming these challenges requires balancing biological and technical factor

Advances in sequencing technology (e.g., long-read sequencing)

Careful experimental design (e.g., choosing read length and depth)

After today, you should have a better understanding of

Repetitive DNA (i.e., repeats)

Challenges in genome assembly

Repeats are a widespread feature of many genomes

Repeats are sequences of DNA that occur multiple times in the genome

AGCTGATC

TTAGCCGA

CGAT CGAT CGAT CGAT

Note: Repeats are especially abundant in eukaryotic genomes, comprising up to 50% of human DNA.

Common types of repeats

Tandem repeats: Consecutive copies of the same sequence.

Interspersed repeats: Similar sequences scattered throughout the genome

AGCTGATC

TTAGCCGA

CGAT CGAT

Repeats create ambiguity in placing reads during assembly

How will the assembler know the difference between these two options? Maybe it has high coverage instead of more repeats?

AGCTGATC

TTAGCCGA

CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT

Reads

Option 1

AGCTGATC

TTAGCCGA

CGAT CGAT CGAT CGAT CGAT

Option 2

Repeats create ambiguity in placing reads during assembly

Reads from repeats may align to multiple locations, making it unclear where they belong.

Which repeat did a read come from? Who knows ...

AGCTGATC

TTAGCCGA

CGAT CGAT CGAT CGAT CGAT

Repeats can lead to fragmented assemblies or misassembled contigs

Fragmentation: Assemblers may break contigs at repetitive regions, resulting in gaps.

AGCTGATC

TTAGCCGA

CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT

versus

Collapsing repeats: Similar repeats may be merged into a single copy, leading to incorrect assemblies.

AGCTGATC

TTAGCCGA

CGAT CGAT CGAT CGAT CGAT

Read length affects the ability to resolve repeats during assembly

Short reads: Often shorter than repeat regions, making it difficult to span and resolve repeats.

AGCTGATC

TTAGCCGA

CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT

Long reads: Can span entire repetitive regions, reducing ambiguity and improving assembly accuracy.

Paired-end reads span repetitive regions, providing distance information

Paired-end reads are sequenced from both ends of a DNA fragment, with a known distance between the reads

AGCTGATC

TTAGCCGA

CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT CGAT

If I have two reads at the ends of a repeat, and I know the distance between the reads, I know the length of repeat

Forward read

Reverse read

Known read gap

(This is why having paired-end reads that do not overlap is helpful.)

After today, you should have a better understanding of

Sequence errors

Challenges in genome assembly

Sequencing errors disrupt overlaps, complicating assembly

Sequencing errors interfere with overlaps by creating mismatches between reads

Assemblers must distinguish true overlaps from errors, which dramatically increases computational complexity.

Assemblers use error correction and redundancy to handle sequencing errors

Redundant data (high coverage) helps correct errors by identifying the most likely base (i.e., consensus)

TACGATCGGATTACGCGTAGGCTAGCTTACGGACTCGATGTACGATCGGATTACGCGTAGG

Real sequencing errors

can be fixed in high-coverage areas

Real SNPs

can be confidently detected when all reads have the same base

After today, you should have a better understanding of

Outputs of genome assembly tools

Contigs are continuous sequences assembled from overlapping reads

Contigs are the first level of assembly, where reads are merged based on overlaps. In other words, they represent reconstructed DNA without gaps (i.e., continuous).

What do contigs indicate?

Longer contigs suggest better assembly quality.
Fragmented contigs indicate challenges such as repeats or low coverage.

Contig FASTA files store the reconstructed DNA sequences

What is a contig FASTA file?

Contains the sequence of each contig in FASTA format.
Used for downstream analysis like annotation and comparison.

>NODE_1_length_251580_cov_96.965763
GCCTTTTTCATATTCTTGAAACATATATAGCAGTACATCTATGTCTACTTTAGGTTTTAT
TGACATAAATAAAGCTCCCTTCAAAGTTTTCATTTTTTCAATGTCTACTTTGAAGGGAGC
ATTTCACTGAACTTTGTTCAGGCTCTTTTTAAATGTATATCAGGCATGGCGGCGACTTGA
TAGTGAAAGTCCATATATGCTTTGTAGTCAAAACTGCTAGCGGATATTGTTATCTTAACA
...

Header format:

NODE_1 is the number of the contig

length_251580 is the sequence length

cov_96.965763 is the k-mer coverage of the largest k used in assembly (will be discussed on Thursday)

Scaffolds use paired-end reads to bridge gaps between contigs

Scaffolds are higher-order assemblies formed by ordering and orienting contigs

What do scaffolds indicate?

Larger scaffolds suggest fewer gaps and better assembly resolution.
Remaining gaps in scaffolds are represented as "N" regions.

Paired-end reads provide distance and orientation information to connect contigs.

Scaffold FASTA files combine contigs into longer sequences

What is a scaffold FASTA file includes contigs linked by paired-end reads with "N"s as the base

Provides a higher-level view of genome assembly, bridging contigs to form scaffolds.

>NODE_1_length_335019_cov_108.862920
TTATATTGGCAGTAGTTGACTGAACGAAAATGCGCTTGTAACAAGCTTTTTTCAATTCTA
GTCAACCTTGCCGGGGTGGGACGACGAAATAAATTTTGCGAAAATATCATTTCTGTCCCA
CTCCCTAATTTAAACATTTTAAAATATACCAATTACTTTCATCCAAAGTGATCCTAAACC
AATCCAGATAATAAAGTAGACGAAACCTAATATTAAGTTCATTGTCCACCAACGTTTTTG
...
CATTTAAAATTTCTTGTGACATAGCATTCACCTCCTTTTAGAGCCACTTATTATTTATAA
TAATTAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGGCTCTTATGCA
GTTGGAGCGAAGATCCAACTGTAAACCATAGTGTACTTATTATTTATAATAATTAGTGGC
...
TTACTTTGAAATACTTTAAAAAAATAAGACACTTTCGTA
>NODE_2_length_262462_cov_97.035104

>NODE_1_length_335019_cov_108.862920
TTATATTGGCAGTAGTTGACTGAACGAAAATGCGCTTGTAACAAGCTTTTTTCAATTCTA
GTCAACCTTGCCGGGGTGGGACGACGAAATAAATTTTGCGAAAATATCATTTCTGTCCCA
CTCCCTAATTTAAACATTTTAAAATATACCAATTACTTTCATCCAAAGTGATCCTAAACC
AATCCAGATAATAAAGTAGACGAAACCTAATATTAAGTTCATTGTCCACCAACGTTTTTG
...

>NODE_5_length_181792_cov_108.741524
TGGCTCTTATGCAGTTGGAGCGAAGATCCAACTGTAAACCATAGTGTACTTATTATTTAT
AATAATTAGTGGCTCTTATGCAGTTGGAGCGAAGATCCAACTGTAAACCATAGTGTACTT
ATTATTTATAATAATTAGTGGCTCTTATGCAGTTGGAGCGAAGATCCAACTGTAAACCAT
AGTGTACTTATTATTTGTAATAATATTGTAGAGTCTGAGACATAAATCAATGTTCAATGC
...

Contigs

Scaffold

We almost always use this file for downstream processes.

After today, you should have a better understanding of

Assessing assembly quality

N50 is the length of the shortest contig that covers 50% of the assembly

Sort contigs by length in descending order.
Add lengths sequentially until 50% of the total assembly length is covered.
The length of the last contig added is the N50.

Genome size (e.g., length of E. coli genome)

Higher N50 values indicate more contiguous assemblies

Largest contigs that make up the first 50

Remaining contigs

N50 = 8

L50 is the number of contigs required to cover 50% of the assembly length

Lower L50 values indicate fewer, larger contigs, which is better for assembly quality

L50 = 4

For L50, count the number of contigs used in the N50 calculation

Genome size (e.g., length of E. coli genome)

Largest contigs that make up the first 50

Remaining contigs

Total assembly length approximates the genome size; deviations could indicate missing data

Bandage helps interpret assembly graphs and identify unresolved regions

We can visualize contigs and how they connect with a Bandage graph

Each colored line is a contig/scaffold

Highly branched assemblies are not ideal

Bandage helps interpret assembly graphs and identify unresolved regions

Here is an example of a real, highly branched assembly.

Islands are sequences that we cannot merge into our assembly above (often sequencing errors)

Let's get practical with SPAdes

SPAdes is a popular prokaryote genome assembler

GitHub

Based on De Bruijn graphs with numerous improvements

Builds multisized graphs with different k's

Leads to fragmented graphs and helps reduce repeat collapsing

Collapsed, tangled graphs great for low-coverage regions

By using multiple graphs, SPAdes can better handle variable coverage

Large k

Small k

Graph simplification and correction

Potential bulge

Removal of a bulge will quickly deteriorate the graph and lose read information

If P needs to be removed, we "project" the information (e.g., coverage) onto Q

P's edges are then removed in the process

Potential tips

Removes P (shortest) and projects information onto Q

Assemblers provide contigs and scaffolds

We can visualize this using an assembly graph from a tool called Bandage

Contigs

Scaffolds

Each island contains one or more contigs

Each solid line is called a "node" (Why? I have no idea.) and represent a contig

suggests how these contigs connect to form a scaffold

connection

Each