COMP333
Algorithm Theory and Design
Daniel Sutantyo
Department of Computing
Macquarie University
Lecture slides adapted from lectures given by Frank Cassez, Mark Dras, and Bernard Mans
Summary
- Algorithm complexity (running time, recursion tree)
- Algorithm correctness (induction, loop invariants)
-
Problem solving methods:
- exhaustive search
- dynamic programming
- greedy method
- divide-and-conquer
- algorithms involving strings
- probabilistic method
- algorithms involving graphs
String Algorithms
- Longest common subsequence
- Edit distance
Longest Common Subsequence
definition
- Given two sequences
\(X = \{x_1,x_2,\dots\,x_m\}\) and \(Y= \{y_1,y_2,\dots,y_n\}\),
find the longest common subsequence of \(X\) and \(Y\)- e.g. \(X = \{ k,i,t,t,e,n \}\)
\(Y = \{ s,i,t,t,i,n,g \} \)
the sequence \(\{ i,t,t,n \}\) is a solution
- e.g. \(X = \{ k,i,t,t,e,n \}\)
- What is the brute-force solution?
- Example:
- \(X = \{ k,i,t,t,e,n \}\)
- \(Y = \{ s,i,t,t,i,n,g \} \)
- Brute force solution:
- Generate all subsequences of \(X\)
- Generate all subsequence of \(Y\)
- For each subsequence of \(X\), compare it with a subsequence of \(Y\)
- \(O(2^n)\) where \(n\) is the total length of the sequences \(X\) and \(Y\)
Longest Common Subsequence
brute force
- Example:
- \(X = \{ k,i,t,t,e,n \}\)
- \(Y = \{ s,i,t,t,i,n,g \} \)
- Can you improve the brute force approach?
- do you need to compare all these?
- \(\{k,i,t,t\}\) with \(\{s,i,t,t\}\)
- \(\{k,i,t,t\}\) with \(\{s,i,t,t,i\}\)
- \(\{k,i,t,t\}\) with \(\{s,i,t,t,i,n\}\)
- \(\{k,i,t,t\}\) with \(\{s,i,t,t,i,n,g\}\)
- do you need to compare all these?
Longest Common Subsequence
brute force
- Example:
- \(X = \{ k,i,t,t,e,n \}\)
- \(Y = \{ s,i,t,t,i,n,g \} \)
- Let's get some intuition:
- \(\{k,i,t,t\}\) with \(\{s,i,t,t\}\)
- \(\{k,i,t,t\}\) with \(\{s,i,t,t,i\}\)
- \(\{k,i,t,t\}\) with \(\{s,i,t,t,i,n\}\)
- \(\{k,i,t,t\}\) with \(\{s,i,t,t,i,n,g\}\)
- Intution 1: Why do we keep on comparing \(k\)? Can we drop it?
Longest Common Subsequence
brute force
Longest Common Subsequence
brute force
- Example:
- \(X = \{ k,i,t,t,e,n \}\)
- \(Y = \{ s,i,t,t,i,n,g \} \)
- Let's get some intuition:
- \(\{i,t,t\}\) with \(\{s,i,t,t\}\)
- \(\{i,t,t\}\) with \(\{s,i,t,t,i\}\)
- \(\{i,t,t\}\) with \(\{s,i,t,t,i,n\}\)
- \(\{i,t,t\}\) with \(\{s,i,t,t,i,n,g\}\)
- Intution 2: Why do we keep on comparing \(s\)? Can we drop it?
Longest Common Subsequence
brute force
- Example:
- \(X = \{ k,i,t,t,e,n \}\)
- \(Y = \{ s,i,t,t,i,n,g \} \)
- Let's get some intuition:
- \(\{i,t,t\}\) with \(\{i,t,t\}\)
- \(\{i,t,t\}\) with \(\{i,t,t,i\}\)
- \(\{i,t,t\}\) with \(\{i,t,t,i,n\}\)
- \(\{i,t,t\}\) with \(\{i,t,t,i,n,g\}\)
- Intution 3: Why do we keep on comparing \(i\)? Can we drop it?
-
Example:
- \(X = \{ k,i,t,t,e,n \}\)
- \(Y = \{ s,i,t,t,i,n,g \} \)
- Is there some sort of structure that we can exploit?
- Optimal substructure:
- does the longest common subsequence problem have an optimal substructure?
Longest Common Subsequence
brute force
- \(X = \{ x_1, x_2, x_3, \dots, x_m \} \)
- \(Y = \{ y_1, y_2, y_3, \dots, y_n \} \)
- Let \(Z = \{ z_1, z_2, z_3, \dots, z_k \} \) be the LCSS of \(X\) and \(Y\)
- If \(x_1 = y_1\), then Z should contain \(x_1 = y_1\), i.e. \(z_1 = x_1 = y_1\), and \(Z_2 = \{z_2,z_3,\dots,z_k\}\) is the LCSS of \(X_2\) and \(Y_2\)
Case A:
Case B:
Longest Common Subsequence
optimal substructure
- If \(x_1 \ne y_1\), then Z is either
- the LCSS of \(X_2 =\{x_2,x_3,\dots,x_m\}\), and \(Y=\{y_1,y_2,\dots,y_n\}\), or
- the LCSS of \(X = \{x_1,x_2,\dots,x_m\}\) and \(Y_2 =\{y_2,y_3,\dots,y_n\}\)
Longest Common Subsequence
optimal substructure
kitten
- If \(x_1 \ne y_1\), then \(Z\) is either
- the LCSS of \(X_2 =\{x_2,x_3,\dots,x_m\}\), and \(Y=\{y_1,y_2,\dots,y_n\}\), or
- the LCSS of \(X = \{x_1,x_2,\dots,x_m\}\) and \(Y_2 =\{y_2,y_3,\dots,y_n\}\)
Case A:
sitting
itten
sitting
kitten
itting
Longest Common Subsequence
optimal substructure
Proof (by contradiction):
- \(Z\) is the LCSS of \(X\) and \(Y.\)
- If Z is NOT the LCSS of \(X_2\) and \(Y\), that means they have a longer common subsequence than \(Z\), say \(Z^*\).
- This means \(Z^*\) is the LCSS of \(X\) and \(Y\), a contradiction!
- Proof is symmetrical for the case \(X\) and \(Y_2\)
- If \(x_1 \ne y_1\), then Z is either
- the LCSS of \(X_2 =\{x_2,x_3,\dots,x_m\}\), and \(Y=\{y_1,y_2,\dots,y_n\}\), or
- the LCSS of \(X = \{x_1,x_2,\dots,x_m\}\) and \(Y_2 =\{y_2,y_3,\dots,y_n\}\)
Case A:
Longest Common Subsequence
optimal substructure
Case B:
- If \(x_1 = y_1\), then \(Z\) should contain \(x_1 = y_1\), i.e. \(z_1 = x_1 = y_1\), and \(Z_2 = \{z_2,z_3,\dots,z_k\}\) is the LCSS of \(X_2\) and \(Y_2\)
itten
itting
tten
itting
itten
tting
tten
tting
Longest Common Subsequence
optimal substructure
Case B:
Proof (by contradiction):
- if \(Z\) does not contain \(x_1\), then we can always append \(x_1\) to it, making a longer common subsequence
- if \(Z_2\) is not the LCSS of \(X_2\) and \(Y_2\), then there is another common subsequence \(Z^*\) that is longer. If we append \(x_1\) to \(Z^*\), \(Z^*\) would be longer than \(Z\), a contradiction!
- If \(x_1 = y_1\), then \(Z\) should contain \(x_1 = y_1\), i.e. \(z_1 = x_1 = y_1\), and \(Z_2 = \{z_2,z_3,\dots,z_k\}\) is the LCSS of \(X_2\) and \(Y_2\)
Longest Common Subsequence
recursive relation
\(x_1\ x_2\ x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\(x_2\ x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\(x_1\ x_2\ x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\(x_2\ x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\(x_1 = y_1\)
\(x_1 \ne y_1\)
\(x_1 \ne y_1\)
Longest Common Subsequence
recursive relation
\[ \text{LCSS}(X,Y) = \begin{cases} 0 & \text{if \(X\) or \(Y\) is empty}\\1 + \text{LCSS}(X_2,Y_2) & \text{if $x_1 = y_1$}\\\max\left(\text{LCSS}(X_2,Y_1),\text{LCSS}(X_1,Y_2)\right) & \text{if $x_1 \ne y_2$}\end{cases} \]
\(x_1\ x_2\ x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\(x_2\ x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\(x_1\ x_2\ x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\(x_2\ x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\(x_1 = y_1\)
\(x_1 \ne y_1\)
\(x_1 \ne y_1\)
Longest Common Subsequence
overlapping subproblems
\(x_1\ x_2\ x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\( x_2\ x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\(x_1\ x_2\ x_3 \dots x_m\)
\( y_2\ y_3\dots y_n\)
\( x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\(x_2\ x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\(x_1\ x_2\ x_3 \dots x_m\)
\(y_3\dots y_n\)
\(x_2\ x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\( x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\( x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\( x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
Longest Common Subsequence
overlapping subproblems
\(x_1\ x_2\ x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\( x_2\ x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\(x_1\ x_2\ x_3 \dots x_m\)
\( y_2\ y_3\dots y_n\)
\( x_3 \dots x_m\)
\(y_1\ y_2\ y_3\dots y_n\)
\(x_1\ x_2\ x_3 \dots x_m\)
\(y_3\dots y_n\)
\(x_2\ x_3 \dots x_m\)
\( x_3 \dots x_m\)
\(y_2\ y_3\dots y_n\)
\(\dots\)
\(\dots\)
\(\dots\)
\( y_2\ y_3\dots y_n\)
Longest Common Subsequence
overlapping subproblems
kitten
sitting
itten
sitting
kitten
itting
tten
sitting
itten
itting
kitten
tting
itten
tting
tten
itting
tten
tting
Longest Common Subsequence
top-down solution
public static int lcss_top_down(String x, String y) {
int i = x.length()-1, j = y.length()-1;
if (x.length() == 0 || y.length() == 0)
return 0;
if (lcss[i][j] != -1)
return lcss[i][j];
else if (x.charAt(0) == y.charAt(0))
return lcss[i][j] = i=1 + lcss_top_down(x.substring(1),y.substring(1));
else
return lcss[i][j] = Math.max(lcss_top_down(x.substring(1),y), lcss_top_down(x,y.substring(1)));
}
Longest Common Subsequence
bottom-up solution
kitten
sitting
itten
sitting
kitten
itting
tten
sitting
itten
itting
kitten
tting
itten
tting
tten
itting
tten
tting
Longest Common Subsequence
bottom-up solution
kitten
sitting
itten
sitting
kitten
itting
tten
sitting
ten
sitting
...
kitten
tting
kitten
ting
kitten
ing
...
itten
itting
itten
tting
itten
ting
itten
ing
...
tten
itting
tten
tting
tten
ting
tten
ing
...
ten
itting
ten
tting
ten
ting
ten
ing
...
...
...
...
...
Longest Common Subsequence
bottom-up solution
sitting | itting | tting | ting | ing | ng | g | ||
kitten | 0 | 0 | ||||||
itten | 0 | 0 | ||||||
tten | 0 | 0 | ||||||
ten | 0 | 0 | ||||||
en | 0 | 0 | ||||||
n | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Longest Common Subsequence
bottom-up solution
sitting | itting | tting | ting | ing | ng | g | ||
kitten | 2 | 1 | 0 | 0 | ||||
itten | 2 | 1 | 0 | 0 | ||||
tten | 1 | 1 | 0 | 0 | ||||
ten | 2 | 2 | 2 | 2 | 1 | 1 | 0 | 0 |
en | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
n | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Longest Common Subsequence
bottom-up solution
sitting | itting | tting | ting | ing | ng | g | ||
kitten | 4 | 4 | 3 | 2 | 2 | 1 | 0 | 0 |
itten | 4 | 4 | 3 | 2 | 2 | 1 | 0 | 0 |
tten | 3 | 3 | 3 | 2 | 1 | 1 | 0 | 0 |
ten | 2 | 2 | 2 | 2 | 1 | 1 | 0 | 0 |
en | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
n | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Longest Common Subsequence
bottom-up solution
for (int i = 0; i < n+1; i++) lcss[m][i] = 0;
for (int i = 0; i < m+1; i++) lcss[i][n] = 0;
for (int i = m-1; i > -1; i--){
for (int j = n-1; j > -1; j--){
// if character matches, then go diagonally
if(a.charAt(i) == b.charAt(j))
lcss[i][j] = 1 + lcss[i+1][j+1];
// else, compare the cell to your right and to your bottom,
// and pick the larger one
else
lcss[i][j] = Integer.max(lcss[i][j+1], lcss[i+1][j]);
}
}
Longest Common Subsequence
Edit Distance
- Problem:
- Given strings \(X = (x_1,x_2,\dots,x_m)\) and \(Y = (y_1,y_2,\dots,y_n)\), find the minimum number of transformations required to convert \(X\) to \(Y\)
- The most common transformations are:
- inserting a single character
- deleting a single character
- substituting a single character
Edit Distance
- The most common transformations are:
- inserting a single character
- lack \(\rightarrow\) slack
- deleting a single character
- slack \(\rightarrow\) sack
- substituting a single character
- lack \(\rightarrow\) sack
- inserting a single character
- First suggested by Levensthein (1966)
- Levensthein distance: each operation is \(O(1)\)
- what is the edit distance between lack and sack?
Edit Distance
- Note that CLRS (Exercise 15-5, page 406) is a bit more rigorous, they added three more operations:
- twiddle: copy two characters, but switch the order
- e.g. blame \(\rightarrow\) balme
- kill: stop processing the first string
- e.g. internationalisation \(\rightarrow\) international
- copy: copy a character
- e.g. happy \(\rightarrow\) happy
- in CLRS this is 5 copy operations
- twiddle: copy two characters, but switch the order
Edit Distance
how does the algorithm work
- Should be able to do this in 3 transformations, so the edit distance is 3:
- substitution:
- kitten \(\rightarrow\) sitten
- substitution:
- sitten \(\rightarrow\) sittin
- insertion:
- sittin \(\rightarrow\) sitting
- substitution:
- How did you get this?
Edit Distance
- There are several metrics for the edit distance problem
- Damerau-Levenshtein distance
- CA \(\rightarrow\) ABC, distance is two
- Hamming distance
- strings of equal length only
- Jaro-Winkler distance
- Damerau-Levenshtein distance
- Let us stick with the basic Levensthein distance version,
- i.e. insert, replace, delete
Edit Distance
how does the algorithm work
kitten
sitting
(this is the example in https://en.wikipedia.org/wiki/Edit_distance)
Edit Distance
how does the algorithm work
- Think about the algorithm for LCSS
kitten
sitting
itten
sitting
kitten
itting
tten
sitting
itten
itting
kitten
tting
Edit Distance
how does the algorithm work?
- In LCSS, how do you decide if you should cut off k or s?
- You don't, you try both!
- So with edit distance, how do you decide if you should insert, replace, or delete?
- You don't, you try all three!
Edit Distance
how does the algorithm work?
kitten
sitting
skitten
sitting
sitten
sitting
itten
sitting
kitten
itting
itten
itting
sitten
sitting
stten
sitting
tten
sitting
Edit Distance
how does the algorithm work?
lack
sack
slack
sack
sack
sack
ack
sack
lack
ack
ack
ack
sack
sack
sck
sack
ck
sack
aack
ack
ck
ck
k
k
alack
ack
ack
ack
ck
ck
ack
ack
ck
ck
ck
ack
sck
sack
sk
sack
k
sack
k
ack
ck
ack
ack
ack
ak
ack
k
ack
ack
ck
lack
ck
Edit Distance
how does the algorithm work?
- For each problem, you have three subproblems:
\(\text{EditDistance}(X,Y)\)
\(\text{EditDistance}(y_1+X,Y)\)
\(\text{EditDistance}(y_1+X_2,Y)\)
\(\text{EditDistance}(X_2,Y)\)
- where \(X_2 = \{x_2,x_3,\dots,x_m\}\)
insert
delete
replace
Edit Distance
brute-force complexity
\(n\)
\(n-1\)
\(n-1\)
\(n-1\)
- \(n(1+3+9+\cdots+3^n= (3^{n+1}-1)/2\)
- Complexity is \(3^n\)!
\(1\)
\(3n-3\)
\(9n-18\)
. . . . . . . . . . . . . . . . . . . . . . . . .
Edit Distance
how does the algorithm work?
lack
sack
slack
sack
sack
sack
ack
sack
lack
ack
ack
ack
sack
sack
sck
sack
ck
sack
aack
ack
ck
ck
k
k
alack
ack
ack
ack
ck
ck
ack
ack
ck
ck
ck
ack
sck
sack
sk
sack
k
sack
k
ack
ck
ack
ack
ack
ak
ack
k
ack
ack
ck
lack
ck
Edit Distance
bottom-up approach
lack
sack
ack
sack
ck
sack
k
sack
sack
slack
sack
???
sack
sack
Edit Distance
bottom-up approach
lack
sack
ack
sack
ck
sack
k
sack
sack
slack
sack
???
sack
sack
- After you do an insertion, the next step is guaranteed to be the removal of the leading characters
- No need to store this state
Edit Distance
bottom-up approach
lack
sack
ack
sack
ck
sack
k
sack
sack
lack
ack
ack
ack
lack
ck
lack
k
lack
Edit Distance
bottom-up approach
lack
sack
ack
sack
ck
sack
k
sack
sack
lack
ack
ack
ack
- After you do a substitution, the next step is also guaranteed to be removal of the leading characters
- No need to store this state either
lack
ck
lack
k
lack
Edit Distance
bottom-up approach
lack
sack
ack
sack
ck
sack
k
sack
sack
lack
ack
ack
ack
ck
ack
k
ack
ack
lack
ck
ack
ck
ck
ck
k
ck
ck
lack
k
ack
k
ck
k
k
k
k
lack
ack
ck
k
Edit Distance
bottom-up approach
lack | ack | ck | k | ||
sack | 1 | 1 | 2 | 3 | 4 |
ack | 1 | 0 | 1 | 2 | 3 |
ck | 2 | 1 | 0 | 1 | 2 |
k | 3 | 2 | 1 | 0 | 1 |
4 | 3 | 2 | 1 | 0 |
Edit Distance
bottom-up approach
kitten
sitting
itten
sitting
tten
sitting
ten
sitting
...
kitten
itting
itten
itting
tten
itting
ten
itting
...
kitten
tting
itten
tting
tten
tting
ten
tting
...
kitten
ting
itten
ting
tten
ting
ten
ting
...
...
...
...
...
...
Edit Distance
bottom-up approach
kitten | itten | tten | ten | en | n | ||
sitting | 7 | ||||||
itting | 6 | ||||||
tting | 5 | ||||||
ting | 4 | ||||||
ing | 3 | ||||||
ng | 2 | ||||||
g | 1 | 1 | |||||
6 | 5 | 4 | 3 | 2 | 1 | 0 |
Edit Distance
bottom-up approach
kitten | itten | tten | ten | en | n | ||
sitting | 6 | 7 | |||||
itting | 5 | 6 | |||||
tting | 4 | 5 | |||||
ting | 3 | 4 | |||||
ing | 2 | 3 | |||||
ng | 2 | 1 | 2 | ||||
g | 6 | 5 | 4 | 3 | 2 | 1 | 1 |
6 | 5 | 4 | 3 | 2 | 1 | 0 |
Edit Distance
bottom-up approach
kitten | itten | tten | ten | en | n | ||
sitting | 6 | 6 | 7 | ||||
itting | 5 | 5 | 6 | ||||
tting | 4 | 4 | 5 | ||||
ting | 3 | 3 | 4 | ||||
ing | 3 | 2 | 2 | 3 | |||
ng | 6 | 5 | 4 | 3 | 2 | 1 | 2 |
g | 6 | 5 | 4 | 3 | 2 | 1 | 1 |
6 | 5 | 4 | 3 | 2 | 1 | 0 |
Edit Distance
bottom-up approach
kitten | itten | tten | ten | en | n | ||
sitting | 3 | 3 | 4 | 5 | 6 | 6 | 7 |
itting | 3 | 2 | 3 | 4 | 5 | 5 | 6 |
tting | 4 | 3 | 2 | 3 | 4 | 4 | 5 |
ting | 5 | 4 | 3 | 2 | 3 | 3 | 4 |
ing | 5 | 4 | 4 | 3 | 2 | 2 | 3 |
ng | 6 | 5 | 4 | 3 | 2 | 1 | 2 |
g | 6 | 5 | 4 | 3 | 2 | 1 | 1 |
6 | 5 | 4 | 3 | 2 | 1 | 0 |
A Couple More Algorithms
- Huffman encoding
- String comparison (KMP algorithm)
COMP333 Algorithm Theory and Design - W7 2019 - Strings Algorithms
By Daniel Sutantyo
COMP333 Algorithm Theory and Design - W7 2019 - Strings Algorithms
Lecture notes for Week 7 of COMP333, 2019, Macquarie University
- 168