Loading
aalexmmaldonado
This is a live streamed presentation. You will automatically follow the presenter and see the slide they're currently on.
Computational Biology
(BIOSC 1540)
Feb 6, 2025
Lecture 05B
Sequence Alignment
Methodology
Assignments
Quizzes
CBytes
ATP until the next reward: 1,653
Sequence alignment finds the best way to arrange two sequences to maximize similarity.
The challenge: Finding the best alignment among exponentially many possibilities.
Query 1 ATGACTTTATCCATTCTAGTTGCACATGACTTGCAACGAGTAATTGGTTTTGAAAATCAA 60
Sbjct 2555705 .....A.....A..AA.T..C..T..C..TAAA...A....C.....G.ACC........ 2555646
Query 61 TTACCTTGGCATCTACCAAATGATTTGAAGCATGTTAAAAAATTATCAACTGGTCATACT 120
Sbjct 2555645 ...........CT.............A......A.....C..C.GA.C.....GA....A 2555586
Query 121 TTAGTAATGGGTCGTAAGACATTTGAATCGATTGGTAAACCACTACCGAATCGTCGAAAT 180
Sbjct 2555585 C.T.......CA..G..A..T...A.T..T..A..G..G...T.G..A...A.A..T..C 2555526
Query 181 GTTGTACTTACTTC---AGATACAAGTTTCAACGTAGAGGGCGTTGATGTAATTCATTCT 237
Sbjct 2555525 ..C.....C...AACCA..C.T..--.....C.A.GA....-..A.....T..AA.C... 2555469
Query 238 ATTGAAGATATTTATCAACTACCGGGCCATGTTTTTATATTTGGAGGGCAAACATTATTT 297
Sbjct 2555468 C....T..A...A.AG.GT..T.T..T....................A.....G....AC 2555409
Query 298 GAAGAAATGATTGATAAAGTGGACGACATGTATATTACTGTTATTGAAGGTAAATTTCGT 357
Sbjct 2555408 ....C.........CC.G..A..T..T........C..A..A..A..T..A..G....AA 2555349
Query 358 GGTGATACGTTCTTTCCACCTTATACATTTGAAGACTGGGAAGTTGCCTCTTCAGTTGAA 417
Sbjct 2555348 ..A..C..A...........A..C.....C...A..........C.AA........A... 2555289
Query 418 GGTAAACTAGATGAGAAAAATACAATTCCACATACCTTTCTACATTTAATTCGTAAAAAA 477
Sbjct 2555288 ...C..........A........T..A..G.....A..CT........G.G....G.... 2555229The number of possible alignments is exponential for two sequences of length m and n
Suppose we want to identify the optimal alignment for these two sequences
ATGTC
ATGC-
ATGTC
AT-GC
ATGTC
-ATGC
A-TGTC
ATGC--
ATGTC
A-TGCATGTCATGCWe could take the brute force approach of trying every single possible alignment and computing the score
Two seqeunces of length 100 have
possible alignments
ATGTC
ATG-CFor our previous sequences, this is the optimal alignment
ATGTC
-ATGC
How can we know it's optimal if I don't try every possible combination?
Why should we even compute this alignment score? We know it would be very low.
Instead of trying all alignments, DP builds an optimal alignment step by step from the start.
We can assert that the optimal final solution is the one where the first step (match of A) is optimal, then the second step (match of T) is optimal, etc.
Guarantees the best alignment without exhaustive searching.
Alignments are built incrementally from smaller subproblems
Insert a gap?
ATGC
ATGTC
0. Start with an "empty" alignment and define scoring scheme
1. Which move would give me the highest score for the first position?
or
Alignment match
(This would be a sequence match or mismatch.)
Let's check the first characters:
A
Match: +1
Mismatch: -1
Gap: -2
A
and
This alignment match would be a sequence match; thus, our optimal alignment should start with this
Insert a gap?
A
A
2. Which move would give me the highest score for the second position?
or
Alignment match
Let's check the characters:
T
T
and
Sequences:
TGTC
TGC
Current optimal alignment:
A
A
Match: +1
Mismatch: -1
Gap: -2
Alignment match would be best
Next optimal alignment:
Score:
1
A
A
T
T
Insert a gap?
AT
AT
3. Which move would give me the highest score for the third position?
or
Alignment match
Let's check the characters:
G
G
and
Sequences:
GTC
GC
Current optimal alignment:
Match: +1
Mismatch: -1
Gap: -2
Alignment match would be best
Next optimal alignment:
A
A
T
T
Score:
2
A
A
T
T
G
G
Insert a gap?
ATG
ATG
4. Which move would give me the highest score for the fourth position?
or
Alignment match
Let's check the characters:
T
C
and
Sequences:
TC
C
Current optimal alignment:
Match: +1
Mismatch: -1
Gap: -2
At first glance, it would seem that a sequence mismatch would be best because it would only decrease our score by one instead of two
A
A
T
T
G
G
Next optimal alignment?
Score:
3
However, we would need to consider what happens later
A
A
T
T
G
G
T
C
Scoring matrix
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
Each cell (i, j) depends on:
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
First, enter zero in our first coordinate (0, 0)
We need to fill in each cell by moving from other cells starting from (0, 0)
Each move "uses" a nucleotide from a row, column, or both
1
0
2
3
4
5
0
1
2
3
4
5
-1
Alignment
-2
-3
A
T
T
A
C
A
A
T
T
C
0
A
-
T
-
-
A
T
-
-4
(Disclaimer: these values are not correct for the final matrix.)
Moving right or down uses a gap and you add the penalty to previous score
Scoring scheme
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
Scoring scheme
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
The last cell in our scoring matrix represents our final score of this alignment
-3
-4
-5
-1
-2
-3
-4
-5
-
A
-
A
-
T
-
T
-
C
Alignment score: -5
A
-
T
-
T
-
A
-
C
-
Alignment score: -5
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
Scoring scheme
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
Diagonal moves make a pair
-3
-4
-5
-1
-2
-3
-4
-5
1
-2
A
A
If match: +1
If mismatch: -1
T
A
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
To fill in other cells, we need to find the best move (highest score) from
-3
-4
-5
-1
-2
-3
-4
-5
earlier, adjacent cells
Let's figure out
this score
Option 1
A
A
Match (+1)
0 + 1 = 1
Option 2
A
Gap (-1)
-1 + -1 = -2
Option 3
A
Gap (-1)
-1 + -1 = -2
-
-
Scoring scheme
1
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
Scoring scheme
Option 1
T
A
Mismatch (-1)
-1 + -1 = -2
Option 2
A
Gap (-1)
-2 + -1 = -3
-
Option 3
Gap (-1)
1 + -1 = 0
-
T
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
Repeat until we fill the matrix
The last number represents the best possible alignment score
Scoring scheme
We get the alignment by tracing back our moves to (0, 0) from our best score
Starting from the bottom left, what is the last move we made to get this score?
This is the last part of our alignment
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
-
C
0 + -1 != 2
C
-
1 + -1 != 2
C
C
1 + 1 = 2
Repeat for the next one
This is the second to last part of our alignment
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
-
T
0 + -1 != 1
C
C
A
-
2 + -1 = 1
C
C
A
T
1 + -1 != 1
C
C
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
A
A
-
A
T
T
T
T
A
-
C
C
-
A
A
A
T
T
T
T
A
-
C
C
Global alignment aligns sequences from start to end
Guarantees finding the optimal global alignment between two sequences
Key characteristics:
Focuses on finding regions of high similarity within sequences
Key characteristics:
We have a few algorithm changes
Zero is the lowest score (i.e., if negative, make it zero)
Scoring scheme
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
0
0
A
T
T
A
C
A
A
T
T
C
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
2
1
0
0
0
1
3
2
0
1
0
2
2
1
0
0
1
3
0
Start alignment at highest cell
Stop aligning when you encounter a zero
A
T
T
A
T
T
A
T
T
A
T
T
A
C
-
C
Matrix initialization:
Traceback:
Scoring system:
Lecture 06A:
Read Mapping -
Foundations
Lecture 05B:
Sequence alignment -
Methodology
Today
Tuesday