Computational Biology

(BIOSC 1540)

Sep 12, 2024

Lecture 06:
Sequence alignment

Announcements

  • A02 is due tonight at 11:59 pm
  • A03 will be posted tomorrow
  • My goal is to have all grades done by Sunday!

After today, you should be able to

1.  Define sequence alignment and explain its importance in bioinformatics.
2.  Describe the basic principles of scoring systems in sequence alignment.
3.  Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4.  Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5.  Introduce the concept of multiple sequence alignment (MSA), including its importance and challenges.

Biological sequences reveal evolutionary relationships

We are all familiar with the central dogma and how sequences play a large role

HOX genes: A highly conserved gene

Plays a crucial role in embryonic development, particularly in determining the body plan and specifying the anterior-posterior axis

How do we know it's conserved?

By aligning sequences, we can interpret conservation

Infrequent changes (i.e., high similarity) suggest an evolutionarily conserved sequence

Pairwise alignment reveals relationships between biological sequences

Multiple Sequence Alignment (MSA) extends pairwise comparisons

MSA is the process of aligning three or more biological sequences simultaneously

  • Identifies conserved regions across multiple species
  • Reveals patterns not visible in pairwise comparisons

Aligning sequences can provide more insight than just evolution

Aligning sequences can provide more insight than just conservation

  • Functional annotation
  • RNA and protein structure
  • Disease-associated mutations
  • Vaccine design

After today, you should be able to

1.  Define sequence alignment and explain its importance in bioinformatics.
2.  Describe the basic principles of scoring systems in sequence alignment.
3.  Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4.  Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5.  Introduce the concept of multiple sequence alignment (MSA), including its importance and challenges.

Alignment scores guide the selection of meaningful alignments

  • Objectivity: Provides a quantitative measure for comparison
  • Optimization: Allows algorithms to find the best alignment
  • Significance: Helps distinguish real homology from random similarity

Importance of scoring in alignment selection

Alignment elements reflect evolutionary events in sequences

Match: Identical characters in aligned positions

  • Represents conserved regions or no change
  • Example score: +1

Gap: Dash (-) inserted to improve alignment

  • Represents insertions or deletions (indels)
  • Example score: -2
ATGCC
|||||
ATGCC
ATGCC
|| ||
ATACC

Mismatch: Different characters in aligned positions

  • Indicates substitutions or mutations
  • Example score: -1
ATGCC
|| ||
AT-CC

Gap penalties significantly impact alignment outcomes

Linear gap penalty: Fixed cost for each gap

  • Example: -2 for each gap, regardless of length

Affine gap penalty: Different costs for opening and extending gaps

  • Example: Gap open = -4, Gap extend = -1
ATGCCCTGGCAT
|||       ||
ATG-------AT

Number of gaps

Score

Gap score

7

-2

-14

x

=

ATGCCCTGGCAT
|||       ||
ATG-------AT

First gap

-4

Additional gaps

6

+

Score

Gap score

-1

-10

=

x

Gap penalty choices reflect biological assumptions

Affine penalties:

  1. Better handling of long indels
  2. More biologically realistic
ATGCCCTGGCAT
|||       ||
ATG-------AT

Implications of gap penalty types

Linear penalties:

  • Simpler to implement
  • May over-penalize long gaps

Biological rationale:

  • Single mutation event often causes multi-base indel
  • Affine penalties better model this biological reality

-14

-10

vs

Advanced scoring methods enhance alignment accuracy

Sophisticated scoring approaches

  1. Position-specific gap penalties:
    • Reduce penalties in variable regions
    • Increase penalties in conserved regions
  2. Residue-specific gap penalties:
    • Adjust penalties based on amino acid properties
  3. Terminal gap penalties:
    • Often reduced to allow end gaps in local alignments

Protein alignments require sophisticated scoring systems

  • Proteins have 20 amino acids (vs. 4 nucleotides in DNA/RNA)
  • Simple match/mismatch scoring is insufficient because:
    1. Some amino acid substitutions are more likely than others
    2. Chemically similar amino acids often substitute without affecting function
    3. Evolutionary relationships between amino acids are complex

Substitution matrices quantify amino acid replacement probabilities

  • The probability that amino acid i mutates into amino acid j for all pairs of amino acids
  • Constructed by assembling a large and diverse sample of verified amino acid alignments
  • Reflect the true probabilities of mutations occurring through a period of evolution
  • Examples: PAM and BLOSUM

After today, you should be able to

1.  Define sequence alignment and explain its importance in bioinformatics.
2.  Describe the basic principles of scoring systems in sequence alignment.

3.  Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4.  Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5.  Introduce the concept of multiple sequence alignment (MSA), including its importance and challenges.

Global alignment compares sequences in their entirety

Global alignment aligns sequences from start to end

  • Key characteristics:
    1. Attempts to align every residue in both sequences
    2. Introduces gaps as necessary to maintain end-to-end alignment
    3. Optimizes the overall alignment score for the entire sequences
  • Guarantees finding the optimal global alignment between two sequences
  • Basic principle: Build a matrix of alignment scores, then trace back to find the best alignment

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

First, enter zero in our first coordinate (0, 0)

We need to fill in each cell by moving from other cells starting from (0, 0)

Each move "uses" a nucleotide from a row, column, or both

1

0

2

3

4

5

0

1

2

3

4

5

-1

Alignment

-2

-3

A

T

T

A

C

A

A

T

T

C

0

A

-

T

-

-

A

T

-

-4

(Disclaimer: these values are not correct for the final matrix.)

Moving right or down uses a gap and you add the penalty to previous score

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

The last cell in our scoring matrix represents our final score of this alignment

-3

-4

-5

-1

-2

-3

-4

-5

-

A

-

A

-

T

-

T

-

C

Alignment score: -5

A

-

T

-

T

-

A

-

C

-

Alignment score: -5

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

Diagonal moves make a pair

-3

-4

-5

-1

-2

-3

-4

-5

1

-2

A

A

If match: +1

If mismatch: -1

T

A

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

To fill in other cells, we need to find the best move (highest score) from

-3

-4

-5

-1

-2

-3

-4

-5

earlier, adjacent cells

Let's figure out

this score

Option 1

A

A

Match (+1)

0 + 1 = 1

Option 2

A

Gap (-1)

-1 + -1 = -2

Option 3

A

Gap (-1)

-1 + -1 = -2

-

-

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

1

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

Option 1

T

A

Mismatch (-1)

-1 + -1 = -2

Option 2

A

Gap (-1)

-2 + -1 = -3

-

Option 3

Gap (-1)

1 + -1 = 0

-

T

Needleman-Wunsch

Let's align two sequences:

ATTAC

AATTC

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

Repeat until we fill the matrix

The last number represents the best possible alignment score

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1

Needleman-Wunsch

We get the alignment by tracing back our moves to (0, 0) from our best score

Starting from the bottom left, what is the last move we made to get this score?

This is the last part of our alignment

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

-

C

0 + -1 != 2

C

-

1 + -1 != 2

C

C

1 + 1 = 2

Needleman-Wunsch

Repeat for the next one

This is the second to last part of our alignment

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

-

T

0 + -1 != 1

C

C

A

-

2 + -1 = 1

C

C

A

T

1 + -1 != 1

C

C

There can be multiple optimal alingments

D

1

0

2

3

4

5

0

1

2

3

4

5

-1

-2

A

T

T

A

C

A

A

T

T

C

0

-3

-4

-5

-1

-2

-3

-4

-5

1

0

-1

-2

-3

0

0

1

0

-1

-1

-1

1

2

1

-2

0

0

1

1

-3

-1

-1

0

2

A

A

-

A

T

T

T

T

A

-

C

C

-

A

A

A

T

T

T

T

A

-

C

C

Global alignment is not always useful

Advantages

  • Provides a complete picture of sequence similarity
  • Ideal for detecting overall conservation patterns
  • Useful for phylogenetic analysis of related sequences

Limitations

  • May force alignment of unrelated regions in divergent sequences
  • Less effective for sequences of very different lengths
  • Can be computationally intensive for long sequences

1.  Define sequence alignment and explain its importance in bioinformatics.
2.  Describe the basic principles of scoring systems in sequence alignment.
3.  Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.

4.  Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.
5.  Introduce the concept of multiple sequence alignment (MSA), including its importance and challenges.

After today, you should be able to

Local alignment identifies best matching subsequences

Focuses on finding regions of high similarity within sequences

  • Does not require aligning entire sequences end-to-end
  • Allows for identification of conserved regions or domains

Key characteristics:

  • Aligns subsections of sequences
  • Ignores poorly matching regions
  • Can find multiple areas of similarity in a single comparison

Smith-Waterman

We have a few algorithm changes

Zero is the lowest score (i.e., if negative, make it zero)

Scoring scheme

  • Match: +1
  • Mismatch: -1
  • Gap: -1
D

1

0

2

3

4

5

0

1

2

3

4

5

0

0

A

T

T

A

C

A

A

T

T

C

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

2

1

0

0

0

1

3

2

0

1

0

2

2

1

0

0

1

3

0

Start alignment at highest cell

Stop aligning when you encounter a zero

A

T

T

A

T

T

A

T

T

A

T

T

T

T

T

T

Smith-Waterman differs from Needleman-Wunsch in key aspects

Matrix initialization:

  • Needleman-Wunsch: The first row and column are filled with gap penalties
  • Smith-Waterman: First row and column filled with zeros

Traceback:

  • Needleman-Wunsch: Starts from the bottom-right cell
  • Smith-Waterman: Starts from highest scoring cell in the matrix

Scoring system:

  • Needleman-Wunsch: Allows negative scores
  • Smith-Waterman: Negative scores are set to zero

Protein motif identification exemplifies local alignment utility

Can identify functional regions

  • Protein domains: Functional or structural units within proteins
  • Active sites: Regions directly involved in protein function
  • Binding motifs: Short sequences that interact with other molecules
  • Signal sequences: Regions that direct protein localization
  • Post-translational modification sites: Areas subject to chemical modifications

After today, you should be able to

1.  Define sequence alignment and explain its importance in bioinformatics.
2.  Describe the basic principles of scoring systems in sequence alignment.
3.  Explain the principles and steps of global alignment using the Needleman-Wunsch algorithm.
4.  Describe the concept and procedure of local alignment using the Smith-Waterman algorithm.

5.  Introduce the concept of multiple sequence alignment, including its importance and challenges.

Multiple Sequence Alignment compares three or more sequences simultaneously

Key characteristics:

  • Aligns multiple sequences in a single analysis
  • Introduces gaps to maximize alignment of similar characters
  • Preserves the order of characters in each sequence

Definition of MSA: Arranges three or more biological sequences (DNA, RNA, or protein) to identify regions of similarity

Aims to infer structural, functional, or evolutionary relationships among the sequences

Popular MSA tools include Clustal Omega, MAFFT, and MUSCLE

Before the next class, you should

  • Start A03, which is due next Thursday at 11:59 pm.

Lecture 07:
Transcriptomics

Lecture 06:
Sequence alignment

Today

Tuesday