Loading
aalexmmaldonado
This is a live streamed presentation. You will automatically follow the presenter and see the slide they're currently on.
Computational Biology
(BIOSC 1540)
Sep 10, 2024
Lecture 05:
Gene annotation
1. Explain the graph traversal and contig extraction process in genome assemblers.
2. Understand key output files and quality metrics of genome assembly results.
3. Define gene annotation and describe its key components.
4. Outline the main computational methods used in gene prediction and annotation.
5. Analyze and interpret basic gene annotation data and outputs.
Results in CGTAAAT
CG
GT
TA
AA
AT
AAT
ATG
GGC
GCG
TGG
CGT
GTA
TAA
AAA
AAG
CGA
GAA
AGG
2
2
2
1
1
1
1
1
2
1
1
1
Results in: AATGGCGTAAAGGCGAA
Graphs in practice are not this easy
General overview
Multiple approaches are used and comes down to personal preference
High coverage: Suggests that the node is likely a true sequence rather than an error
Hubs: Indegree and outdegree != 1
Hubs are shown as filled-in nodes
Hub
Not a hub
How do you choose a walk?
What factors would you look for?
Talk to your neighbors
Long paths are desired but not always reliable due to potential repeats
High, consistent read coverage
Unique, non-branching paths
1. Explain the graph traversal and contig extraction process in genome assemblers.
2. Understand key output files and quality metrics of genome assembly results.
3. Define gene annotation and describe its key components.
4. Outline the main computational methods used in gene prediction and annotation.
5. Analyze and interpret basic gene annotation data and outputs.
SPAdes is a popular prokaryote genome assembler
Based on De Bruijn graphs with numerous improvements
Build Hamming graphs for k-mers
Undirected edges for Hamming distance of n nucleotide differences
Identify strong k-mers based on clustering (i.e., high similarity)
Estimate read error based on base qualitites
Leads to fragmented graphs and helps reduce repeat collapsing
Collapsed, tangled graphs great for low-coverage regions
By using multiple graphs, SPAdes can better handle variable coverage
Large k
Small k
Potential bulge
Removal of a bulge will quickly deteriorate the graph and lose read information
If P needs to be removed, we "project" the information (e.g., coverage) onto Q
P's edges are then removed in the process
Potential tips
Removes P (shortest) and projects information onto Q
Read 1 (forward) and Read 2 (reverse) are stored in FASTQ
If our insert (i.e., DNA sample) is longer than reads, then we don't sequence the inner distance
Should we minimize this inner distance?
False
Read 2
ATATATATATATATATAT
Read 1
ATATATATATATATATATAT
Gap
ATATATATATATATATATATATATATATATATATAT
Suppose I have an "AT" repeat for both Read 1 and 2
The assembler will have to figure out if these are overlapped or separated, but by how far?
Having a gap tells me they don't overlap, but for how long?
Knowing length of Read 1, Read 2, and total insert length allows me to calculate gap length
Assembly algorithms (e.g., SPAdes) can estimate this and refine their results
We can visualize this using an assembly graph from a tool called Bandage
Contigs
Scaffolds
Each island contains one or more contigs
Each solid line is called a "node" (Why? I have no idea.) and represent a contig
suggests how these contigs connect to form a scaffold
connection
Each
1. Explain the graph traversal and contig extraction process in genome assemblers.
2. Understand key output files and quality metrics of genome assembly results.
3. Define gene annotation and describe its key components.
4. Outline the main computational methods used in gene prediction and annotation.
5. Analyze and interpret basic gene annotation data and outputs.
Structural annotation identifies critical genetic elements such as genes, promoters, and regulatory elements
Functional annotation predicts the function of genetic elements
Introns and alternative splicing complicate annotation
We will focus on prokaryotes because eukaryotes are way more complicated
Example: Prokka
Example: AUGUSTUS
Prokaryotes
Eukaryotes
Probabilistic models to identify open reading frames
Accuracy demands supporting evidence like mRNA sequencing
1. Explain the graph traversal and contig extraction process in genome assemblers.
2. Understand key output files and quality metrics of genome assembly results.
3. Define gene annotation and describe its key components.
4. Outline the main computational methods used in gene prediction and annotation.
5. Analyze and interpret basic gene annotation data and outputs.
(I will use different notation than the paper.)
Seek the standard start codons: ATG, GTG or TTG
Seek stop codons based on the translation table
Score potential ORFs
Ribosomal binding site motif score
Start type score
Upstream score
Coding score
Took training data from 12 annotated genomes
Computed frequency of RBS motif bin in
Search for RBS motif after start codon; choose whichever has the lowest bin number
Start
Spacer
RBS
Took training data from 12 annotated genomes
Computed frequency of start codon in
Start
Stop
-2 to -1
-44 to -15
By analyzing base frequency in specific upstream regions, their annotation results improved
Essentially looking for promotors
ATGGCC
CAGCTG
GGGCCC
ACTAGT
Example hexamers called "words"
Computed frequency of nucleotide hexamers called "words" in
Compute probability of observing word within the whole genome
within genes
Compute the probability of observing word
Word coding score
It can be thought of as
"How often does this word appear in genes?"
Gene coding score
Sum hexamer word score and shift over one codon at a time
1
2
3
ATGCATGCTTAG
Potential protein
Non-coding
Similarity search will be our topic for Thursday
1. Explain the graph traversal and contig extraction process in genome assemblers.
2. Understand key output files and quality metrics of genome assembly results.
3. Define gene annotation and describe its key components.
4. Outline the main computational methods used in gene prediction and annotation.
5. Analyze and interpret basic gene annotation data and outputs.
>ECNNONJI_02637 Dihydrofolate reductase
MTLSILVAHDLQRVIGFENQLPWHLPNDLKHVKKLSTGHTLVMGRKTFESIGKPLPNRRN
VVLTSDTSFNVEGVDVIHSIEDIYQLPGHVFIFGGQTLFEEMIDKVDDMYITVIEGKFRG
DTFFPPYTFEDWEVASSVEGKLDEKNTIPHTFLHLIRKK
Lecture 06:
Sequence alignment
Lecture 05:
Gene annotation
Today
Thursday