Computational Biology

(BIOSC 1540)

Feb 6, 2025

Lecture 05B

Sequence Alignment

Methodology

Announcements

Assignments

  • Assignment P01D is due Monday (Feb 10)
  • Assignment P01E will be released on Saturday (Feb 8)?

Quizzes

CBytes

ATP until the next reward:  1,653

After today, you should have a better understanding of

Dynamic programming basics in bioinformatics

Biological data analysis often requires comparing sequences, which can be computationally expensive

  • Biological sequences (DNA, RNA, proteins) are long.
  • Pairwise comparisons scale exponentially—naïve approaches take too long.
  • We need efficient algorithms to compare sequences optimally.

Sequence alignment finds the best way to arrange two sequences to maximize similarity.

The challenge: Finding the best alignment among exponentially many possibilities.

Query  1        ATGACTTTATCCATTCTAGTTGCACATGACTTGCAACGAGTAATTGGTTTTGAAAATCAA  60
Sbjct  2555705  .....A.....A..AA.T..C..T..C..TAAA...A....C.....G.ACC........  2555646

Query  61       TTACCTTGGCATCTACCAAATGATTTGAAGCATGTTAAAAAATTATCAACTGGTCATACT  120
Sbjct  2555645  ...........CT.............A......A.....C..C.GA.C.....GA....A  2555586

Query  121      TTAGTAATGGGTCGTAAGACATTTGAATCGATTGGTAAACCACTACCGAATCGTCGAAAT  180
Sbjct  2555585  C.T.......CA..G..A..T...A.T..T..A..G..G...T.G..A...A.A..T..C  2555526

Query  181      GTTGTACTTACTTC---AGATACAAGTTTCAACGTAGAGGGCGTTGATGTAATTCATTCT  237
Sbjct  2555525  ..C.....C...AACCA..C.T..--.....C.A.GA....-..A.....T..AA.C...  2555469

Query  238      ATTGAAGATATTTATCAACTACCGGGCCATGTTTTTATATTTGGAGGGCAAACATTATTT  297
Sbjct  2555468  C....T..A...A.AG.GT..T.T..T....................A.....G....AC  2555409

Query  298      GAAGAAATGATTGATAAAGTGGACGACATGTATATTACTGTTATTGAAGGTAAATTTCGT  357
Sbjct  2555408  ....C.........CC.G..A..T..T........C..A..A..A..T..A..G....AA  2555349

Query  358      GGTGATACGTTCTTTCCACCTTATACATTTGAAGACTGGGAAGTTGCCTCTTCAGTTGAA  417
Sbjct  2555348  ..A..C..A...........A..C.....C...A..........C.AA........A...  2555289

Query  418      GGTAAACTAGATGAGAAAAATACAATTCCACATACCTTTCTACATTTAATTCGTAAAAAA  477
Sbjct  2555288  ...C..........A........T..A..G.....A..CT........G.G....G....  2555229

Comparing all possible sequence alignments grows exponentially with sequence length

The number of possible alignments is exponential for two sequences of length m and n

Suppose we want to identify the optimal alignment for these two sequences 

ATGTC
ATGC-
ATGTC
AT-GC
ATGTC
-ATGC
A-TGTC
ATGC--
ATGTC
A-TGC
ATGTC
ATGC
N(m,n) \approx \sum_{d=0}^{\min(m,n)} \frac{(m+n-d)!}{\,d!(m-d)!(n-d)!}

We could take the brute force approach of trying every single possible alignment and computing the score

Two seqeunces of length 100 have 

2.05 \times 10^{75}

possible alignments

Dynamic programming (DP) finds the best alignment by breaking it into smaller subproblems

ATGTC
ATG-C

For our previous sequences, this is the optimal alignment

ATGTC
-ATGC

How can we know it's optimal if I don't try every possible combination?

Why should we even compute this alignment score? We know it would be very low.

Instead of trying all alignments, DP builds an optimal alignment step by step from the start.

We can assert that the optimal final solution is the one where the first step (match of A) is optimal, then the second step (match of T) is optimal, etc.

Guarantees the best alignment without exhaustive searching.

Dynamic programming constructs an optimal alignment by systematically building the alignment

Alignments are built incrementally from smaller subproblems

Insert a gap?

ATGC

ATGTC

0. Start with an "empty" alignment and define scoring scheme

1. Which move would give me the highest score for the first position?

or

Alignment match

(This would be a sequence match or mismatch.)

Let's check the first characters: 

A

Match: +1

Mismatch: -1

Gap: -2

A

and

This alignment match would be a sequence match; thus, our optimal alignment should start with this

Insert a gap?

A

A

2. Which move would give me the highest score for the second position?

or

Alignment match

Let's check the characters: 

T

T

and

Sequences:

TGTC

TGC

Current optimal alignment:

A

A

Match: +1

Mismatch: -1

Gap: -2

Alignment match would be best

Next optimal alignment:

Score: 

1

A

A

T

T

Insert a gap?

AT

AT

3. Which move would give me the highest score for the third position?

or

Alignment match

Let's check the characters: 

G

G

and

Sequences:

GTC

GC

Current optimal alignment:

Match: +1

Mismatch: -1

Gap: -2

Alignment match would be best

Next optimal alignment:

A

A

T

T

Score: 

2

A

A

T

T

G

G

Insert a gap?

ATG

ATG

4. Which move would give me the highest score for the fourth position?

or

Alignment match

Let's check the characters: 

T

C

and

Sequences:

TC

C

Current optimal alignment:

Match: +1

Mismatch: -1

Gap: -2

At first glance, it would seem that a sequence mismatch would be best because it would only decrease our score by one instead of two

A

A

T

T

G

G

Next optimal alignment?

Score: 

3

However, we would need to consider what happens later

A

A

T

T

G

G

T

C

After today, you should have a better understanding of

Dynamic programming basics in bioinformatics

Scoring matrix

Dynamic programming constructs an optimal alignment by systematically filling a scoring matrix

  • The matrix ensures that each alignment decision is optimal up to that point.
  • The value in each cell reflects the best possible score for aligning the prefixes of two sequences up to that point.
  • Computed using previous scores from adjacent cells (left, top, diagonal).
  • The final score in the matrix represents the best possible alignment score for the entire sequences.
D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

Each cell in the scoring matrix represents the best alignment score up to that position

  • The value in each cell reflects the best possible score for aligning the prefixes of two sequences up to that point.
  • Computed using previous scores from adjacent cells (left, top, diagonal).
  • The final score in the matrix represents the best possible alignment score for the entire sequences.

The value of each cell is determined by considering three possible alignment decisions

  • Each cell (i, j) depends on:

  • Diagonal (Match/Mismatch) → Extending the alignment by pairing two bases.
    • Score = previous diagonal cell + match/mismatch score.
  • Left (Insertion in sequence A) → Introducing a gap in sequence A.
    • Score = previous left cell + gap penalty.
  • Top (Insertion in sequence B) → Introducing a gap in sequence B.
    • Score = previous top cell + gap penalty.

Dynamic programming ensures an optimal alignment by always building on previous optimal solutions

  • The best sub-alignment at each step is based on previous decisions.
  • This follows the principle of optimal substructure (i.e., optimal solutions are built from smaller optimal subproblems).
  • Unlike brute force, we never re-calculate the same alignment twice—each result is stored and reused.

The optimal alignment is found by tracing back from the highest-scoring cell

  • Start from the bottom-right (global alignment) or highest score (local alignment).
  • Move diagonal, left, or up based on the path that contributed to the best score.
  • Continue until reaching the top-left corner (global) or first zero (local).

After today, you should have a better understanding of

Needleman‐Wunsch algorithm (global alignment)

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

First, enter zero in our first coordinate (0, 0)

We need to fill in each cell by moving from other cells starting from (0, 0)

Each move "uses" a nucleotide from a row, column, or both

1

0

2

3

4

5

0

1

2

3

4

5

-1

Alignment

-2

-3

A

T

T

A

C

A

A

T

T

C

0

A

-

T

-

-

A

T

-

-4

(Disclaimer: these values are not correct for the final matrix.)

Moving right or down uses a gap and you add the penalty to previous score

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

The last cell in our scoring matrix represents our final score of this alignment

-3

-4

-5

-1

-2

-3

-4

-5

-

A

-

A

-

T

-

T

-

C

Alignment score: -5

A

-

T

-

T

-

A

-

C

-

Alignment score: -5

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

Diagonal moves make a pair

-3

-4

-5

-1

-2

-3

-4

-5

1

-2

A

A

If match: +1

If mismatch: -1

T

A

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

To fill in other cells, we need to find the best move (highest score) from

-3

-4

-5

-1

-2

-3

-4

-5

earlier, adjacent cells

Let's figure out

this score

Option 1

A

A

Match (+1)

0 + 1 = 1

Option 2

A

Gap (-1)

-1 + -1 = -2

Option 3

A

Gap (-1)

-1 + -1 = -2

-

-

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

1

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

Option 1

T

A

Mismatch (-1)

-1 + -1 = -2

Option 2

A

Gap (-1)

-2 + -1 = -3

-

Option 3

Gap (-1)

1 + -1 = 0

-

T

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

Repeat until we fill the matrix

The last number represents the best possible alignment score

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

Needleman-Wunsch

We get the alignment by tracing back our moves to (0, 0) from our best score

Starting from the bottom left, what is the last move we made to get this score?

This is the last part of our alignment

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

-

C

0 + -1 != 2

C

-

1 + -1 != 2

C

C

1 + 1 = 2

Needleman-Wunsch

Repeat for the next one

This is the second to last part of our alignment

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

-

T

0 + -1 != 1

C

C

A

-

2 + -1 = 1

C

C

A

T

1 + -1 != 1

C

C

There can be multiple optimal alingments

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

A

A

-

A

T

T

T

T

A

-

C

C

-

A

A

A

T

T

T

T

A

-

C

C

Global alignment compares sequences in their entirety

Global alignment aligns sequences from start to end

Guarantees finding the optimal global alignment between two sequences

Key characteristics:

  1. Attempts to align every residue in both sequences
  2. Introduces gaps as necessary to maintain end-to-end alignment
  3. Optimizes the overall alignment score for the entire sequences

After today, you should have a better understanding of

Smith‐Waterman algorithm (local alignment)

Local alignment identifies best matching subsequences

Focuses on finding regions of high similarity within sequences

  • Does not require aligning entire sequences end-to-end
  • Allows for identification of conserved regions or domains

Key characteristics:

  • Aligns subsections of sequences
  • Ignores poorly matching regions
  • Can find multiple areas of similarity in a single comparison

Smith-Waterman

We have a few algorithm changes

Zero is the lowest score (i.e., if negative, make it zero)

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1
D

1

0

2

3

4

5

0

1

2

3

4

5

0

0

A

T

T

A

C

A

A

T

T

C

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

2

1

0

0

0

1

3

2

0

1

0

2

2

1

0

0

1

3

0

Start alignment at highest cell

Stop aligning when you encounter a zero

A

T

T

A

T

T

A

T

T

A

T

T

A

C

-

C

Smith-Waterman differs from Needleman-Wunsch in key aspects

Matrix initialization:

  • Needleman-Wunsch: The first row and column are filled with gap penalties
  • Smith-Waterman: First row and column filled with zeros

Traceback:

  • Needleman-Wunsch: Starts from the bottom-right cell
  • Smith-Waterman: Starts from each highest-scoring cell in the matrix

Scoring system:

  • Needleman-Wunsch: Allows negative scores
  • Smith-Waterman: Negative scores are set to zero

After today, you should have a better understanding of

Python

Before the next class, you should

Lecture 06A:

Read Mapping -
Foundations

Lecture 05B:

Sequence alignment -
Methodology

Today

Tuesday

  • Start P01D (due Feb 10)

BIOSC 1540: L05B (Alignment)

By aalexmmaldonado

BIOSC 1540: L05B (Alignment)

  • 170