Computational Biology
(BIOSC 1540)
Sep 12, 2024
Lecture 06:
Sequence alignment
1. Define sequence alignment and explain its importance in bioinformatics.
2. Describe the basic principles of scoring systems in sequence alignment.
3. Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4. Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5. Introduce the concept of multiple sequence alignment (MSA), including its importance and challenges.
We are all familiar with the central dogma and how sequences play a large role
Plays a crucial role in embryonic development, particularly in determining the body plan and specifying the anterior-posterior axis
How do we know it's conserved?
Infrequent changes (i.e., high similarity) suggest an evolutionarily conserved sequence
MSA is the process of aligning three or more biological sequences simultaneously
1. Define sequence alignment and explain its importance in bioinformatics.
2. Describe the basic principles of scoring systems in sequence alignment.
3. Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4. Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5. Introduce the concept of multiple sequence alignment (MSA), including its importance and challenges.
Importance of scoring in alignment selection
Match: Identical characters in aligned positions
Gap: Dash (-) inserted to improve alignment
ATGCC
|||||
ATGCC
ATGCC
|| ||
ATACC
Mismatch: Different characters in aligned positions
ATGCC
|| ||
AT-CC
Linear gap penalty: Fixed cost for each gap
Affine gap penalty: Different costs for opening and extending gaps
ATGCCCTGGCAT
||| ||
ATG-------AT
Number of gaps
Score
Gap score
7
-2
-14
x
=
ATGCCCTGGCAT
||| ||
ATG-------AT
First gap
-4
Additional gaps
6
+
Score
Gap score
-1
-10
=
x
Affine penalties:
ATGCCCTGGCAT
||| ||
ATG-------AT
Implications of gap penalty types
Linear penalties:
Biological rationale:
-14
-10
vs
Sophisticated scoring approaches
1. Define sequence alignment and explain its importance in bioinformatics.
2. Describe the basic principles of scoring systems in sequence alignment.
3. Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4. Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5. Introduce the concept of multiple sequence alignment (MSA), including its importance and challenges.
Global alignment aligns sequences from start to end
Let's align two sequences:
ATTAC
AATTC
D | ||||||
---|---|---|---|---|---|---|
First, enter zero in our first coordinate (0, 0)
We need to fill in each cell by moving from other cells starting from (0, 0)
Each move "uses" a nucleotide from a row, column, or both
1
0
2
3
4
5
0
1
2
3
4
5
-1
Alignment
-2
-3
A
T
T
A
C
A
A
T
T
C
0
A
-
T
-
-
A
T
-
-4
(Disclaimer: these values are not correct for the final matrix.)
Moving right or down uses a gap and you add the penalty to previous score
Scoring scheme
Let's align two sequences:
ATTAC
AATTC
D | ||||||
---|---|---|---|---|---|---|
Scoring scheme
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
The last cell in our scoring matrix represents our final score of this alignment
-3
-4
-5
-1
-2
-3
-4
-5
-
A
-
A
-
T
-
T
-
C
Alignment score: -5
A
-
T
-
T
-
A
-
C
-
Alignment score: -5
Let's align two sequences:
ATTAC
AATTC
D | ||||||
---|---|---|---|---|---|---|
Scoring scheme
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
Diagonal moves make a pair
-3
-4
-5
-1
-2
-3
-4
-5
1
-2
A
A
If match: +1
If mismatch: -1
T
A
Let's align two sequences:
ATTAC
AATTC
D | ||||||
---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
To fill in other cells, we need to find the best move (highest score) from
-3
-4
-5
-1
-2
-3
-4
-5
earlier, adjacent cells
Let's figure out
this score
Option 1
A
A
Match (+1)
0 + 1 = 1
Option 2
A
Gap (-1)
-1 + -1 = -2
Option 3
A
Gap (-1)
-1 + -1 = -2
-
-
Scoring scheme
1
Let's align two sequences:
ATTAC
AATTC
D | ||||||
---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
Scoring scheme
Option 1
T
A
Mismatch (-1)
-1 + -1 = -2
Option 2
A
Gap (-1)
-2 + -1 = -3
-
Option 3
Gap (-1)
1 + -1 = 0
-
T
Let's align two sequences:
ATTAC
AATTC
D | ||||||
---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
Repeat until we fill the matrix
The last number represents the best possible alignment score
Scoring scheme
We get the alignment by tracing back our moves to (0, 0) from our best score
Starting from the bottom left, what is the last move we made to get this score?
This is the last part of our alignment
D | ||||||
---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
-
C
0 + -1 != 2
C
-
1 + -1 != 2
C
C
1 + 1 = 2
Repeat for the next one
This is the second to last part of our alignment
D | ||||||
---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
-
T
0 + -1 != 1
C
C
A
-
2 + -1 = 1
C
C
A
T
1 + -1 != 1
C
C
D | ||||||
---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
A
A
-
A
T
T
T
T
A
-
C
C
-
A
A
A
T
T
T
T
A
-
C
C
Advantages
Limitations
1. Define sequence alignment and explain its importance in bioinformatics.
2. Describe the basic principles of scoring systems in sequence alignment.
3. Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4. Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5. Introduce the concept of multiple sequence alignment (MSA), including its importance and challenges.
Focuses on finding regions of high similarity within sequences
Key characteristics:
We have a few algorithm changes
Zero is the lowest score (i.e., if negative, make it zero)
Scoring scheme
D | ||||||
---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
0
0
A
T
T
A
C
A
A
T
T
C
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
2
1
0
0
0
1
3
2
0
1
0
2
2
1
0
0
1
3
0
Start alignment at highest cell
Stop aligning when you encounter a zero
A
T
T
A
T
T
A
T
T
A
T
T
T
T
T
T
Matrix initialization:
Traceback:
Scoring system:
Can identify functional regions
1. Define sequence alignment and explain its importance in bioinformatics.
2. Describe the basic principles of scoring systems in sequence alignment.
3. Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4. Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5. Introduce the concept of multiple sequence alignment, including its importance and challenges.
Key characteristics:
Definition of MSA: Arranges three or more biological sequences (DNA, RNA, or protein) to identify regions of similarity
Aims to infer structural, functional, or evolutionary relationships among the sequences
Lecture 07:
Transcriptomics
Lecture 06:
Sequence alignment
Today
Tuesday