COMP333

Algorithm Theory and Design

 

 

Daniel Sutantyo

Department of Computing

Macquarie University

Lecture slides adapted from lectures given by Frank Cassez, Mark Dras, and Bernard Mans

Summary

  • Algorithm complexity (running time, recursion tree)
  • Algorithm correctness (induction, loop invariants)
  • Problem solving methods:
    • exhaustive search
    • dynamic programming
    • greedy method
    • divide-and-conquer
    • algorithms involving strings
    • probabilistic method
    • algorithms involving graphs

String Algorithms

  • Longest common subsequence
  • Edit distance

Longest Common Subsequence

definition

  • Given two sequences
            \(X = \{x_1,x_2,\dots\,x_m\}\) and \(Y= \{y_1,y_2,\dots,y_n\}\),
    find the longest common subsequence of \(X\) and \(Y\)
    • e.g. \(X = \{ k,i,t,t,e,n \}\)
             \(Y = \{ s,i,t,t,i,n,g \} \)
             the sequence \(\{ i,t,t,n \}\) is a solution
  • What is the brute-force solution?
  • Example:
    • \(X = \{ k,i,t,t,e,n \}\)
    • \(Y = \{ s,i,t,t,i,n,g \} \)
  • Brute force solution:
    • Generate all subsequences of \(X\) 
    • Generate all subsequence of \(Y\)
    • For each subsequence of \(X\), compare it with a subsequence of \(Y\)
    • \(O(2^n)\) where \(n\) is the total length of the sequences \(X\) and \(Y\)

Longest Common Subsequence

brute force

  • Example:
    • \(X = \{ k,i,t,t,e,n \}\)
    • \(Y = \{ s,i,t,t,i,n,g \} \)
  • Can you improve the brute force approach?
    • do you need to compare all these?
      • \(\{k,i,t,t\}\) with \(\{s,i,t,t\}\)
      • \(\{k,i,t,t\}\) with \(\{s,i,t,t,i\}\)
      • \(\{k,i,t,t\}\) with \(\{s,i,t,t,i,n\}\)
      • \(\{k,i,t,t\}\) with \(\{s,i,t,t,i,n,g\}\)

Longest Common Subsequence

brute force

  • Example:
    • \(X = \{ k,i,t,t,e,n \}\)
    • \(Y = \{ s,i,t,t,i,n,g \} \)
  • Let's get some intuition:
    • \(\{k,i,t,t\}\) with \(\{s,i,t,t\}\)
    • \(\{k,i,t,t\}\) with \(\{s,i,t,t,i\}\)
    • \(\{k,i,t,t\}\) with \(\{s,i,t,t,i,n\}\)
    • \(\{k,i,t,t\}\) with \(\{s,i,t,t,i,n,g\}\)
  • Intution 1: Why do we keep on comparing \(k\)? Can we drop it?

Longest Common Subsequence

brute force

Longest Common Subsequence

brute force

  • Example:
    • \(X = \{ k,i,t,t,e,n \}\)
    • \(Y = \{ s,i,t,t,i,n,g \} \)
  • Let's get some intuition:
    • \(\{i,t,t\}\) with \(\{s,i,t,t\}\)
    • \(\{i,t,t\}\) with \(\{s,i,t,t,i\}\)
    • \(\{i,t,t\}\) with \(\{s,i,t,t,i,n\}\)
    • \(\{i,t,t\}\) with \(\{s,i,t,t,i,n,g\}\)
  • Intution 2: Why do we keep on comparing \(s\)? Can we drop it?

Longest Common Subsequence

brute force

  • Example:
    • \(X = \{ k,i,t,t,e,n \}\)
    • \(Y = \{ s,i,t,t,i,n,g \} \)
  • Let's get some intuition:
    • \(\{i,t,t\}\) with \(\{i,t,t\}\)
    • \(\{i,t,t\}\) with \(\{i,t,t,i\}\)
    • \(\{i,t,t\}\) with \(\{i,t,t,i,n\}\)
    • \(\{i,t,t\}\) with \(\{i,t,t,i,n,g\}\)
  • Intution 3: Why do we keep on comparing \(i\)? Can we drop it?
  • Example:
    • \(X = \{ k,i,t,t,e,n \}\)
    • \(Y = \{ s,i,t,t,i,n,g \} \)
  • Is there some sort of structure that we can exploit?
  • Optimal substructure:
    • does the longest common subsequence problem have an optimal substructure?

Longest Common Subsequence

brute force

  • \(X = \{ x_1, x_2, x_3, \dots, x_m \} \)
  • \(Y = \{ y_1, y_2, y_3, \dots, y_n \} \)
  • Let \(Z = \{ z_1, z_2, z_3, \dots, z_k \} \) be the LCSS of \(X\) and \(Y\)
  • If \(x_1 = y_1\), then Z should contain \(x_1 = y_1\), i.e. \(z_1 = x_1 = y_1\), and \(Z_2 = \{z_2,z_3,\dots,z_k\}\) is the LCSS of \(X_2\) and \(Y_2\)

Case A:

Case B:

Longest Common Subsequence

optimal substructure

  • If \(x_1 \ne y_1\), then Z is either
    • the LCSS of \(X_2 =\{x_2,x_3,\dots,x_m\}\), and \(Y=\{y_1,y_2,\dots,y_n\}\), or
    • the LCSS of \(X = \{x_1,x_2,\dots,x_m\}\) and \(Y_2 =\{y_2,y_3,\dots,y_n\}\)

Longest Common Subsequence

optimal substructure

kitten
  • If \(x_1 \ne y_1\), then \(Z\) is either
    • the LCSS of \(X_2 =\{x_2,x_3,\dots,x_m\}\), and \(Y=\{y_1,y_2,\dots,y_n\}\), or
    • the LCSS of \(X = \{x_1,x_2,\dots,x_m\}\) and \(Y_2 =\{y_2,y_3,\dots,y_n\}\)

Case A:

sitting
itten
sitting
kitten
itting

Longest Common Subsequence

optimal substructure

Proof (by contradiction):

  • \(Z\) is the LCSS of \(X\) and \(Y.\)
  • If Z is NOT the LCSS of \(X_2\) and \(Y\), that means they have a longer common subsequence than \(Z\), say \(Z^*\).
  • This means \(Z^*\) is the LCSS of \(X\) and \(Y\), a contradiction!
  • Proof is symmetrical for the case \(X\) and \(Y_2\)
  • If \(x_1 \ne y_1\), then Z is either
    • the LCSS of \(X_2 =\{x_2,x_3,\dots,x_m\}\), and \(Y=\{y_1,y_2,\dots,y_n\}\), or
    • the LCSS of \(X = \{x_1,x_2,\dots,x_m\}\) and \(Y_2 =\{y_2,y_3,\dots,y_n\}\)

Case A:

Longest Common Subsequence

optimal substructure

Case B:

  • If \(x_1 = y_1\), then \(Z\) should contain \(x_1 = y_1\), i.e. \(z_1 = x_1 = y_1\), and \(Z_2 = \{z_2,z_3,\dots,z_k\}\) is the LCSS of \(X_2\) and \(Y_2\)
itten
itting
tten
itting
itten
tting
tten
tting

Longest Common Subsequence

optimal substructure

Case B:

Proof (by contradiction):

  • if \(Z\) does not contain \(x_1\), then we can always append \(x_1\) to it, making a longer common subsequence
  • if \(Z_2\) is not the LCSS of \(X_2\) and \(Y_2\), then there is another common subsequence \(Z^*\) that is longer. If we append \(x_1\) to \(Z^*\), \(Z^*\) would be longer than \(Z\), a contradiction!
  • If \(x_1 = y_1\), then \(Z\) should contain \(x_1 = y_1\), i.e. \(z_1 = x_1 = y_1\), and \(Z_2 = \{z_2,z_3,\dots,z_k\}\) is the LCSS of \(X_2\) and \(Y_2\)

Longest Common Subsequence

recursive relation

\(x_1\ x_2\ x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\(x_2\ x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\(x_1\ x_2\ x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\(x_2\ x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\(x_1 = y_1\)

\(x_1 \ne y_1\)

\(x_1 \ne y_1\)

Longest Common Subsequence

recursive relation

\[ \text{LCSS}(X,Y) = \begin{cases} 0 & \text{if \(X\) or \(Y\) is empty}\\1 + \text{LCSS}(X_2,Y_2) & \text{if $x_1 = y_1$}\\\max\left(\text{LCSS}(X_2,Y_1),\text{LCSS}(X_1,Y_2)\right) & \text{if $x_1 \ne y_2$}\end{cases} \] 

\(x_1\ x_2\ x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\(x_2\ x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\(x_1\ x_2\ x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\(x_2\ x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\(x_1 = y_1\)

\(x_1 \ne y_1\)

\(x_1 \ne y_1\)

Longest Common Subsequence

overlapping subproblems

\(x_1\ x_2\ x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\( x_2\ x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\(x_1\ x_2\ x_3 \dots x_m\)

\( y_2\ y_3\dots y_n\)

\( x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\(x_2\ x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\(x_1\ x_2\ x_3 \dots x_m\)

\(y_3\dots y_n\)

\(x_2\ x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\( x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\( x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\( x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

Longest Common Subsequence

overlapping subproblems

\(x_1\ x_2\ x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\( x_2\ x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\(x_1\ x_2\ x_3 \dots x_m\)

\( y_2\ y_3\dots y_n\)

\( x_3 \dots x_m\)

\(y_1\ y_2\ y_3\dots y_n\)

\(x_1\ x_2\ x_3 \dots x_m\)

\(y_3\dots y_n\)

\(x_2\ x_3 \dots x_m\)

\( x_3 \dots x_m\)

\(y_2\ y_3\dots y_n\)

\(\dots\)

\(\dots\)

\(\dots\)

\( y_2\ y_3\dots y_n\)

Longest Common Subsequence

overlapping subproblems

kitten
sitting
itten
sitting
kitten
itting
tten
sitting
itten
itting
kitten
tting

 

 

itten
tting
tten
itting
tten
tting

Longest Common Subsequence

top-down solution

public static int lcss_top_down(String x, String y) {
  int i = x.length()-1, j = y.length()-1;
  if (x.length() == 0 || y.length() == 0)
	return 0;
  if (lcss[i][j] != -1) 
	return lcss[i][j];
  else if (x.charAt(0) == y.charAt(0))
	return lcss[i][j] = i=1 + lcss_top_down(x.substring(1),y.substring(1));
  else 
	return lcss[i][j] = Math.max(lcss_top_down(x.substring(1),y), lcss_top_down(x,y.substring(1)));
}

Longest Common Subsequence

bottom-up solution

kitten
sitting
itten
sitting
kitten
itting
tten
sitting
itten
itting
kitten
tting

 

 

itten
tting
tten
itting
tten
tting

Longest Common Subsequence

bottom-up solution

kitten
sitting
itten
sitting
kitten
itting
tten
sitting
ten
sitting
...
kitten
tting
kitten
ting
kitten
ing
...
itten
itting
itten
tting
itten
ting
itten
ing
...
tten
itting
tten
tting
tten
ting
tten
ing
...
ten
itting
ten
tting
ten
ting
ten
ing
...
...
...
...
...

Longest Common Subsequence

bottom-up solution

sitting itting tting ting ing ng g
kitten 0 0
itten 0 0
tten 0 0
ten 0 0
en 0 0
n 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0

Longest Common Subsequence

bottom-up solution

sitting itting tting ting ing ng g
kitten 2 1 0 0
itten 2 1 0 0
tten 1 1 0 0
ten 2 2 2 2 1 1 0 0
en 1 1 1 1 1 1 0 0
n 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0

Longest Common Subsequence

bottom-up solution

sitting itting tting ting ing ng g
kitten 4 4 3 2 2 1 0 0
itten 4 4 3 2 2 1 0 0
tten 3 3 3 2 1 1 0 0
ten 2 2 2 2 1 1 0 0
en 1 1 1 1 1 1 0 0
n 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0

Longest Common Subsequence

bottom-up solution

for (int i = 0; i < n+1; i++) lcss[m][i] = 0;
for (int i = 0; i < m+1; i++) lcss[i][n] = 0;
		
for (int i = m-1; i > -1; i--){
  for (int j = n-1; j > -1; j--){
    // if character matches, then go diagonally 
    if(a.charAt(i) == b.charAt(j))
      lcss[i][j] = 1 + lcss[i+1][j+1];
    // else, compare the cell to your right and to your bottom, 
    // and pick the larger one
    else
      lcss[i][j] = Integer.max(lcss[i][j+1], lcss[i+1][j]);
  }
}

Longest Common Subsequence

 

Edit Distance

  • Problem:
    • Given strings \(X = (x_1,x_2,\dots,x_m)\) and \(Y = (y_1,y_2,\dots,y_n)\), find the minimum number of transformations required to convert \(X\) to \(Y\)
    • The most common transformations are:
      • inserting a single character
      • deleting a single character
      • substituting a single character

Edit Distance

  • The most common transformations are:
    • inserting a single character
      • lack \(\rightarrow\) slack
    • deleting a single character
      • slack \(\rightarrow\) sack
    • substituting a single character​
      • lack \(\rightarrow\) sack
  • First suggested by Levensthein (1966)
    • Levensthein distance: each operation is \(O(1)\)
    • what is the edit distance between lack and sack?

Edit Distance

  • Note that CLRS (Exercise 15-5, page 406) is a bit more rigorous, they added three more operations:
    • twiddle: copy two characters, but switch the order
      • e.g. blame \(\rightarrow\) balme
    • kill: stop processing the first string
      • e.g. internationalisation \(\rightarrow\) international
    • copy: copy a character
      • e.g. happy \(\rightarrow\) happy
      • in CLRS this is 5 copy operations

Edit Distance

how does the algorithm work

  • Should be able to do this in 3 transformations, so the edit distance is 3:
    • substitution:
      • kitten \(\rightarrow\) sitten
    • substitution:
      • sitten \(\rightarrow\) sittin
    • insertion:
      • sittin \(\rightarrow\) sitting
  • How did you get this?

Edit Distance

  • There are several metrics for the edit distance problem
    • Damerau-Levenshtein distance
      • CA \(\rightarrow\) ABC, distance is two
    • Hamming distance
      • strings of equal length only
    • Jaro-Winkler distance
  • Let us stick with the basic Levensthein distance version,
    • i.e. insert, replace, delete

Edit Distance

how does the algorithm work

kitten
sitting

Edit Distance

how does the algorithm work

  • Think about the algorithm for LCSS
kitten
sitting
itten
sitting
kitten
itting
tten
sitting
itten
itting
kitten
tting

Edit Distance

how does the algorithm work?

  • In LCSS, how do you decide if you should cut off k or s?
    • You don't, you try both!
  • So with edit distance, how do you decide if you should insert, replace, or delete?
    • You don't, you try all three!

Edit Distance

how does the algorithm work?

kitten
sitting
skitten
sitting
sitten
sitting
itten
sitting
kitten
itting
itten
itting
sitten
sitting
stten
sitting
tten
sitting

Edit Distance

how does the algorithm work?

lack
sack
slack
sack
sack
sack
ack
sack
lack
ack
ack
ack
sack
sack
sck
sack
ck
sack
aack
ack
ck
ck
k
k
alack
ack
ack
ack
ck
ck
ack
ack
ck
ck
ck
ack
sck
sack
sk
sack
k
sack
k
ack
ck
ack
ack
ack
ak
ack
k
ack
ack
ck
lack
ck

Edit Distance

how does the algorithm work?

  • For each problem, you have three subproblems:

\(\text{EditDistance}(X,Y)\)

\(\text{EditDistance}(y_1+X,Y)\)

\(\text{EditDistance}(y_1+X_2,Y)\)

\(\text{EditDistance}(X_2,Y)\)

  • where \(X_2 = \{x_2,x_3,\dots,x_m\}\) 

insert

delete

replace

Edit Distance

brute-force complexity

\(n\)

\(n-1\)

\(n-1\)

\(n-1\)

  • \(n(1+3+9+\cdots+3^n= (3^{n+1}-1)/2\)
  • Complexity is \(3^n\)!

\(1\)

\(3n-3\)

\(9n-18\)

. . . . . . . . . . . . . . . . . . . . . . . . .

Edit Distance

how does the algorithm work?

lack
sack
slack
sack
sack
sack
ack
sack
lack
ack
ack
ack
sack
sack
sck
sack
ck
sack
aack
ack
ck
ck
k
k
alack
ack
ack
ack
ck
ck
ack
ack
ck
ck
ck
ack
sck
sack
sk
sack
k
sack
k
ack
ck
ack
ack
ack
ak
ack
k
ack
ack
ck
lack
ck

Edit Distance

bottom-up approach

lack
sack
ack
sack
ck
sack
k
sack
sack
slack
sack
???
sack
sack

Edit Distance

bottom-up approach

lack
sack
ack
sack
ck
sack
k
sack
sack
slack
sack
???
sack
sack
  • After you do an insertion, the next step is guaranteed to be the removal of the leading characters
  • No need to store this state

Edit Distance

bottom-up approach

lack
sack
ack
sack
ck
sack
k
sack
sack
lack
ack
ack
ack
lack
ck
lack
k
lack

Edit Distance

bottom-up approach

lack
sack
ack
sack
ck
sack
k
sack
sack
lack
ack
ack
ack
  • After you do a substitution, the next step is also guaranteed to be removal of the leading characters
  • No need to store this state either
lack
ck
lack
k
lack

Edit Distance

bottom-up approach

lack
sack
ack
sack
ck
sack
k
sack
sack
lack
ack
ack
ack
ck
ack
k
ack
ack
lack
ck
ack
ck
ck
ck
k
ck
ck
lack
k
ack
k
ck
k
k
k
k
lack
ack
ck
k

Edit Distance

bottom-up approach

lack ack ck k
sack 1 1 2 3 4
ack 1 0 1 2 3
ck 2 1 0 1 2
k 3 2 1 0 1
4 3 2 1 0

Edit Distance

bottom-up approach

kitten
sitting
itten
sitting
tten
sitting
ten
sitting
...
kitten
itting
itten
itting
tten
itting
ten
itting
...
kitten
tting
itten
tting
tten
tting
ten
tting
...
kitten
ting
itten
ting
tten
ting
ten
ting
...
...
...
...
...
...

Edit Distance

bottom-up approach

kitten itten tten ten en n
sitting 7
itting 6
tting 5
ting 4
ing 3
ng 2
g 1 1
6 5 4 3 2 1 0

Edit Distance

bottom-up approach

kitten itten tten ten en n
sitting 6 7
itting 5 6
tting 4 5
ting 3 4
ing 2 3
ng 2 1 2
g 6 5 4 3 2 1 1
6 5 4 3 2 1 0

Edit Distance

bottom-up approach

kitten itten tten ten en n
sitting 6 6 7
itting 5 5 6
tting 4 4 5
ting 3 3 4
ing 3 2 2 3
ng 6 5 4 3 2 1 2
g 6 5 4 3 2 1 1
6 5 4 3 2 1 0

Edit Distance

bottom-up approach

kitten itten tten ten en n
sitting 3 3 4 5 6 6 7
itting 3 2 3 4 5 5 6
tting 4 3 2 3 4 4 5
ting 5 4 3 2 3 3 4
ing 5 4 4 3 2 2 3
ng 6 5 4 3 2 1 2
g 6 5 4 3 2 1 1
6 5 4 3 2 1 0

A Couple More Algorithms

 

  • Huffman encoding
  • String comparison (KMP algorithm)

COMP333 Algorithm Theory and Design - W7 2019 - Strings Algorithms

By Daniel Sutantyo

COMP333 Algorithm Theory and Design - W7 2019 - Strings Algorithms

Lecture notes for Week 7 of COMP333, 2019, Macquarie University

  • 168