

Computational Biology
(BIOSC 1540)
Feb 6, 2025
Lecture 05B
Sequence Alignment
Methodology
Announcements
Assignments
Quizzes
CBytes
ATP until the next reward: 1,653
After today, you should have a better understanding of
Dynamic programming basics in bioinformatics
Biological data analysis often requires comparing sequences, which can be computationally expensive
- Biological sequences (DNA, RNA, proteins) are long.
- Pairwise comparisons scale exponentially—naïve approaches take too long.
- We need efficient algorithms to compare sequences optimally.
Sequence alignment finds the best way to arrange two sequences to maximize similarity.
The challenge: Finding the best alignment among exponentially many possibilities.
Query 1 ATGACTTTATCCATTCTAGTTGCACATGACTTGCAACGAGTAATTGGTTTTGAAAATCAA 60
Sbjct 2555705 .....A.....A..AA.T..C..T..C..TAAA...A....C.....G.ACC........ 2555646
Query 61 TTACCTTGGCATCTACCAAATGATTTGAAGCATGTTAAAAAATTATCAACTGGTCATACT 120
Sbjct 2555645 ...........CT.............A......A.....C..C.GA.C.....GA....A 2555586
Query 121 TTAGTAATGGGTCGTAAGACATTTGAATCGATTGGTAAACCACTACCGAATCGTCGAAAT 180
Sbjct 2555585 C.T.......CA..G..A..T...A.T..T..A..G..G...T.G..A...A.A..T..C 2555526
Query 181 GTTGTACTTACTTC---AGATACAAGTTTCAACGTAGAGGGCGTTGATGTAATTCATTCT 237
Sbjct 2555525 ..C.....C...AACCA..C.T..--.....C.A.GA....-..A.....T..AA.C... 2555469
Query 238 ATTGAAGATATTTATCAACTACCGGGCCATGTTTTTATATTTGGAGGGCAAACATTATTT 297
Sbjct 2555468 C....T..A...A.AG.GT..T.T..T....................A.....G....AC 2555409
Query 298 GAAGAAATGATTGATAAAGTGGACGACATGTATATTACTGTTATTGAAGGTAAATTTCGT 357
Sbjct 2555408 ....C.........CC.G..A..T..T........C..A..A..A..T..A..G....AA 2555349
Query 358 GGTGATACGTTCTTTCCACCTTATACATTTGAAGACTGGGAAGTTGCCTCTTCAGTTGAA 417
Sbjct 2555348 ..A..C..A...........A..C.....C...A..........C.AA........A... 2555289
Query 418 GGTAAACTAGATGAGAAAAATACAATTCCACATACCTTTCTACATTTAATTCGTAAAAAA 477
Sbjct 2555288 ...C..........A........T..A..G.....A..CT........G.G....G.... 2555229Comparing all possible sequence alignments grows exponentially with sequence length
The number of possible alignments is exponential for two sequences of length m and n
Suppose we want to identify the optimal alignment for these two sequences
ATGTC
ATGC-
ATGTC
AT-GC
ATGTC
-ATGC
A-TGTC
ATGC--
ATGTC
A-TGCATGTCATGCWe could take the brute force approach of trying every single possible alignment and computing the score
Two seqeunces of length 100 have
possible alignments
Dynamic programming (DP) finds the best alignment by breaking it into smaller subproblems
ATGTC
ATG-CFor our previous sequences, this is the optimal alignment
ATGTC
-ATGC
How can we know it's optimal if I don't try every possible combination?
Why should we even compute this alignment score? We know it would be very low.
Instead of trying all alignments, DP builds an optimal alignment step by step from the start.
We can assert that the optimal final solution is the one where the first step (match of A) is optimal, then the second step (match of T) is optimal, etc.
Guarantees the best alignment without exhaustive searching.
Dynamic programming constructs an optimal alignment by systematically building the alignment
Alignments are built incrementally from smaller subproblems
Insert a gap?
ATGC
ATGTC
0. Start with an "empty" alignment and define scoring scheme
1. Which move would give me the highest score for the first position?
or
Alignment match
(This would be a sequence match or mismatch.)
Let's check the first characters:
A
Match: +1
Mismatch: -1
Gap: -2
A
and
This alignment match would be a sequence match; thus, our optimal alignment should start with this
Insert a gap?
A
A
2. Which move would give me the highest score for the second position?
or
Alignment match
Let's check the characters:
T
T
and
Sequences:
TGTC
TGC
Current optimal alignment:
A
A
Match: +1
Mismatch: -1
Gap: -2
Alignment match would be best
Next optimal alignment:
Score:
1
A
A
T
T
Insert a gap?
AT
AT
3. Which move would give me the highest score for the third position?
or
Alignment match
Let's check the characters:
G
G
and
Sequences:
GTC
GC
Current optimal alignment:
Match: +1
Mismatch: -1
Gap: -2
Alignment match would be best
Next optimal alignment:
A
A
T
T
Score:
2
A
A
T
T
G
G
Insert a gap?
ATG
ATG
4. Which move would give me the highest score for the fourth position?
or
Alignment match
Let's check the characters:
T
C
and
Sequences:
TC
C
Current optimal alignment:
Match: +1
Mismatch: -1
Gap: -2
At first glance, it would seem that a sequence mismatch would be best because it would only decrease our score by one instead of two
A
A
T
T
G
G
Next optimal alignment?
Score:
3
However, we would need to consider what happens later
A
A
T
T
G
G
T
C
After today, you should have a better understanding of
Dynamic programming basics in bioinformatics
Scoring matrix
Dynamic programming constructs an optimal alignment by systematically filling a scoring matrix
- The matrix ensures that each alignment decision is optimal up to that point.
- The value in each cell reflects the best possible score for aligning the prefixes of two sequences up to that point.
- Computed using previous scores from adjacent cells (left, top, diagonal).
- The final score in the matrix represents the best possible alignment score for the entire sequences.
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
Each cell in the scoring matrix represents the best alignment score up to that position
- The value in each cell reflects the best possible score for aligning the prefixes of two sequences up to that point.
- Computed using previous scores from adjacent cells (left, top, diagonal).
- The final score in the matrix represents the best possible alignment score for the entire sequences.
The value of each cell is determined by considering three possible alignment decisions
-
Each cell (i, j) depends on:
-
Diagonal (Match/Mismatch) → Extending the alignment by pairing two bases.
- Score = previous diagonal cell + match/mismatch score.
-
Left (Insertion in sequence A) → Introducing a gap in sequence A.
- Score = previous left cell + gap penalty.
-
Top (Insertion in sequence B) → Introducing a gap in sequence B.
- Score = previous top cell + gap penalty.
Dynamic programming ensures an optimal alignment by always building on previous optimal solutions
- The best sub-alignment at each step is based on previous decisions.
- This follows the principle of optimal substructure (i.e., optimal solutions are built from smaller optimal subproblems).
- Unlike brute force, we never re-calculate the same alignment twice—each result is stored and reused.
The optimal alignment is found by tracing back from the highest-scoring cell
- Start from the bottom-right (global alignment) or highest score (local alignment).
- Move diagonal, left, or up based on the path that contributed to the best score.
- Continue until reaching the top-left corner (global) or first zero (local).
After today, you should have a better understanding of
Needleman‐Wunsch algorithm (global alignment)
Needleman-Wunsch
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
First, enter zero in our first coordinate (0, 0)
We need to fill in each cell by moving from other cells starting from (0, 0)
Each move "uses" a nucleotide from a row, column, or both
1
0
2
3
4
5
0
1
2
3
4
5
-1
Alignment
-2
-3
A
T
T
A
C
A
A
T
T
C
0
A
-
T
-
-
A
T
-
-4
(Disclaimer: these values are not correct for the final matrix.)
Moving right or down uses a gap and you add the penalty to previous score
Scoring scheme
- Match: +1
- Mismatch: -1
- Gap: -1
Needleman-Wunsch
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
Scoring scheme
- Match: +1
- Mismatch: -1
- Gap: -1
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
The last cell in our scoring matrix represents our final score of this alignment
-3
-4
-5
-1
-2
-3
-4
-5
-
A
-
A
-
T
-
T
-
C
Alignment score: -5
A
-
T
-
T
-
A
-
C
-
Alignment score: -5
Needleman-Wunsch
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
Scoring scheme
- Match: +1
- Mismatch: -1
- Gap: -1
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
Diagonal moves make a pair
-3
-4
-5
-1
-2
-3
-4
-5
1
-2
A
A
If match: +1
If mismatch: -1
T
A
Needleman-Wunsch
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
To fill in other cells, we need to find the best move (highest score) from
-3
-4
-5
-1
-2
-3
-4
-5
earlier, adjacent cells
Let's figure out
this score
Option 1
A
A
Match (+1)
0 + 1 = 1
Option 2
A
Gap (-1)
-1 + -1 = -2
Option 3
A
Gap (-1)
-1 + -1 = -2
-
-
Scoring scheme
- Match: +1
- Mismatch: -1
- Gap: -1
1
Needleman-Wunsch
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
Scoring scheme
- Match: +1
- Mismatch: -1
- Gap: -1
Option 1
T
A
Mismatch (-1)
-1 + -1 = -2
Option 2
A
Gap (-1)
-2 + -1 = -3
-
Option 3
Gap (-1)
1 + -1 = 0
-
T
Needleman-Wunsch
Let's align two sequences:
ATTAC
AATTC
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
Repeat until we fill the matrix
The last number represents the best possible alignment score
Scoring scheme
- Match: +1
- Mismatch: -1
- Gap: -1
Needleman-Wunsch
We get the alignment by tracing back our moves to (0, 0) from our best score
Starting from the bottom left, what is the last move we made to get this score?
This is the last part of our alignment
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
-
C
0 + -1 != 2
C
-
1 + -1 != 2
C
C
1 + 1 = 2
Needleman-Wunsch
Repeat for the next one
This is the second to last part of our alignment
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
-
T
0 + -1 != 1
C
C
A
-
2 + -1 = 1
C
C
A
T
1 + -1 != 1
C
C
There can be multiple optimal alingments
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
-1
-2
A
T
T
A
C
A
A
T
T
C
0
-3
-4
-5
-1
-2
-3
-4
-5
1
0
-1
-2
-3
0
0
1
0
-1
-1
-1
1
2
1
-2
0
0
1
1
-3
-1
-1
0
2
A
A
-
A
T
T
T
T
A
-
C
C
-
A
A
A
T
T
T
T
A
-
C
C
Global alignment compares sequences in their entirety
Global alignment aligns sequences from start to end
Guarantees finding the optimal global alignment between two sequences
Key characteristics:
- Attempts to align every residue in both sequences
- Introduces gaps as necessary to maintain end-to-end alignment
- Optimizes the overall alignment score for the entire sequences
After today, you should have a better understanding of
Smith‐Waterman algorithm (local alignment)
Local alignment identifies best matching subsequences
Focuses on finding regions of high similarity within sequences
- Does not require aligning entire sequences end-to-end
- Allows for identification of conserved regions or domains
Key characteristics:
- Aligns subsections of sequences
- Ignores poorly matching regions
- Can find multiple areas of similarity in a single comparison
Smith-Waterman
We have a few algorithm changes
Zero is the lowest score (i.e., if negative, make it zero)
Scoring scheme
- Match: +1
- Mismatch: -1
- Gap: -1
| D | ||||||
|---|---|---|---|---|---|---|
1
0
2
3
4
5
0
1
2
3
4
5
0
0
A
T
T
A
C
A
A
T
T
C
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
2
1
0
0
0
1
3
2
0
1
0
2
2
1
0
0
1
3
0
Start alignment at highest cell
Stop aligning when you encounter a zero
A
T
T
A
T
T
A
T
T
A
T
T
A
C
-
C
Smith-Waterman differs from Needleman-Wunsch in key aspects
Matrix initialization:
- Needleman-Wunsch: The first row and column are filled with gap penalties
- Smith-Waterman: First row and column filled with zeros
Traceback:
- Needleman-Wunsch: Starts from the bottom-right cell
- Smith-Waterman: Starts from each highest-scoring cell in the matrix
Scoring system:
- Needleman-Wunsch: Allows negative scores
- Smith-Waterman: Negative scores are set to zero
After today, you should have a better understanding of
Python
Before the next class, you should
Lecture 06A:
Read Mapping -
Foundations
Lecture 05B:
Sequence alignment -
Methodology
Today
Tuesday
- Start P01D (due Feb 10)
BIOSC 1540: L05B (Alignment)
By aalexmmaldonado
BIOSC 1540: L05B (Alignment)
- 170