String Matching

定義

i 0 1 2 3 4 5 6 7
S a l p h a b e t
S_{i}:
Si:S_{i}:

字串    的第    個字元

S
SS
i
ii
Ex:S_{2} = p
Ex:S2=pEx:S_{2} = p
S_{i...j}:
Si...j:S_{i...j}:

字串    的第        個字元構成的子字串

S
SS
[i, j)
[i,j)[i, j)
Ex:S_{2...5} = pha
Ex:S2...5=phaEx:S_{2...5} = pha
L_{S}:
LS:L_{S}:

字串的長度

問題

給定兩個字串

求是否存在    的子字串

A, B
A,BA, B
A
AA
A_{i...j} = B
Ai...j=BA_{i...j} = B

最簡單的方法就是每次將    往左移一格,這樣的話複雜度是

B
BB
O(L_{A}L_{B})
O(LALB)O(L_{A}L_{B})

Knuth-Morris-Pratt Algorithm

(KMP)

Failure Function

F_{B}(j) = \bigg\{
FB(j)={F_{B}(j) = \bigg\{
-1, \ if \ j = 0
1, if j=0-1, \ if \ j = 0
max\{p:B_{0...p}=B_{j-p...j}\ and\ 0\leq p< j\}\ else
max{p:B0...p=Bjp...j and 0p<j} elsemax\{p:B_{0...p}=B_{j-p...j}\ and\ 0\leq p< j\}\ else

計算

F(0)=-1,\ F(1)=0
F(0)=1, F(1)=0F(0)=-1,\ F(1)=0

若目前已知        的值             ,求

F(k)
F(k)F(k)
(\forall k< i )
(k<i)(\forall k< i )
F(i)\ ?
F(i) ?F(i)\ ?
let \ j = F(i-1)
let j=F(i1)let \ j = F(i-1)
if\ B_{i-1} = B_{j}
if Bi1=Bjif\ B_{i-1} = B_{j}
B_{0...j} = B_{i-1-j...i-1}(\because \ definition)
B0...j=Bi1j...i1( definition)B_{0...j} = B_{i-1-j...i-1}(\because \ definition)
B_{0...j+1} = B_{i-1-j...i}=B_{i-(1+j)...i}
B0...j+1=Bi1j...i=Bi(1+j)...iB_{0...j+1} = B_{i-1-j...i}=B_{i-(1+j)...i}
\therefore\ F(i)=j+1
 F(i)=j+1\therefore\ F(i)=j+1
else
elseelse
let\ j=F(j)\ and\ check\ again
let j=F(j) and check againlet\ j=F(j)\ and\ check\ again
until\ j=-1,F(i)=0
until j=1,F(i)=0until\ j=-1,F(i)=0

配對

i 0 1 2 3 4 5 6 7 8
A A A B A A A A A C
j 0 1 2 3 4 5 6
A A A B A A A
F(j) -1 0 1 0 1 2 3
if\ A_{i}=B_{j} \ or\ j = -1
if Ai=Bj or j=1if\ A_{i}=B_{j} \ or\ j = -1
i++, j++
i++,j++i++, j++
else\ if\ j=L_{B}\Rightarrow\ success!
else if j=LB success!else\ if\ j=L_{B}\Rightarrow\ success!
else\ \ j=F(j)\ \ check \ again
else  j=F(j)  check againelse\ \ j=F(j)\ \ check \ again

Example

string a, b;
int F[N], cnt;
void KMP(){
        cnt = 0;
	//prefix function
	F[0] = -1, F[1] = 0;
	for(int i = 2; i <= b.size(); i++){
		int j = F[i-1];
		while(j != -1 && b[i-1] != b[j]) j = F[j];
		F[i] = (j == -1)?(0):(j+1);
	}
	//string matching
	for(int i = 0, j = 0; i <= a.size();){
		if(j == b.size()) cnt++, j = F[j];
		else if(a[i] == b[j] || j == -1) i++, j++;
		else j = F[j];
	}
}

Gusfield's Algorithm

(Z Algorithm)

Z Function

Z_{A}(i) = \bigg\{
ZA(i)={Z_{A}(i) = \bigg\{
0, \ if \ i = 0
0, if i=00, \ if \ i = 0
max\{p:A_{0...p}=A_{i...i+p}\},\ else
max{p:A0...p=Ai...i+p}, elsemax\{p:A_{0...p}=A_{i...i+p}\},\ else
j 0 1 2 3 4 5 6 7 8
A A B A B A B A A B
Z(j) 0 0 5 0 3 0 1 3 0

原問題轉換為是否存在

使得

C=A+\phi+B
C=A+ϕ+BC=A+\phi+B
\phi
ϕ\phi
A, B
A,BA, B
Z_{C}(L_{B}+1+k)=L_{B}
ZC(LB+1+k)=LBZ_{C}(L_{B}+1+k)=L_{B}
0\leq k < L_{A}
0k<LA0\leq k < L_{A}

 令

 其中   為        未出現過的字元

令   為          且              最大

Z(k)
Z(k)Z(k)
(\forall k< i )
(k<i)(\forall k< i )
Z(i)\ ?
Z(i) ?Z(i)\ ?
l\ \ \ \ \ \ \ l< i\ \ \ \ \ l+Z(l)
l       l<i     l+Z(l)l\ \ \ \ \ \ \ l< i\ \ \ \ \ l+Z(l)

若目前已知        的值             ,求

i'=i-l
i=ili'=i-l
1.\ Z(l)+l\leq i
1. Z(l)+li1.\ Z(l)+l\leq i

從 i 一個一個比較

2.\ l+Z(l)< i+Z(i'),\ Z(i)=Z(i')
2. l+Z(l)<i+Z(i), Z(i)=Z(i)2.\ l+Z(l)< i+Z(i'),\ Z(i)=Z(i')
3.\ l+Z(l)= i+Z(i')
3. l+Z(l)=i+Z(i)3.\ l+Z(l)= i+Z(i')

i+Z(i')
i+Z(i)i+Z(i')

開始比較

4.\ l+Z(l)> i+Z(i'),\ Z(i)=Z(l)-i'
4. l+Z(l)>i+Z(i), Z(i)=Z(l)i4.\ l+Z(l)> i+Z(i'),\ Z(i)=Z(l)-i'

Example

string a, b;
int Z[20001], cnt, L_b = b.size();

void gusfield(){
	b += "!"; b += a;
	Z[0] = 0;
	int l = 0, k;
	for(int i = 1; i < b.size()-L_b+1; i++){
		k = i-l;
		if(l+Z[l] <= i){
			int j = 0;
			while(b[j] == b[i+j]) j++;
			Z[i] = j, l = i;
		}
		else if(l+Z[l] > i+Z[k])
			Z[i] = Z[k];
		else if(l+Z[l] == i+Z[k]){
			int j = Z[k];
			while(b[j] == b[i+j]) j++;
			Z[i] = j, l = i;
		}
		else
			Z[i] = Z[l]-k;

		if(Z[i] == L_b) cnt++;
	}
}

Hash

將字串「分類」的方法,通常有一個雜湊函數        ,表示字串    在        類

h(A)
h(A)h(A)
h(A)
h(A)h(A)
A
AA
\bigcirc \ A=B\Rightarrow h(A)= h(B)
 A=Bh(A)=h(B)\bigcirc \ A=B\Rightarrow h(A)= h(B)
\ ? \ h(A)=h(B)\Rightarrow A=B
 ? h(A)=h(B)A=B\ ? \ h(A)=h(B)\Rightarrow A=B
h(A)=h(B),and\ A\ne B
h(A)=h(B),and ABh(A)=h(B),and\ A\ne B

碰撞

Hash Function

h(A)=\sum\limits_{i=0}^{L_{A}-1}A_{i}\ p^{L_{A}-i-1}\ mod\ M
h(A)=i=0LA1Ai pLAi1 mod Mh(A)=\sum\limits_{i=0}^{L_{A}-1}A_{i}\ p^{L_{A}-i-1}\ mod\ M

其中          為兩相異質數

p,\ M
p, Mp,\ M
h(A)=\sum\limits_{i=0}^{L_{A}-1}A_{i}\ 31^{L_{A}-i-1}\ mod\ 29
h(A)=i=0LA1Ai 31LAi1 mod 29h(A)=\sum\limits_{i=0}^{L_{A}-1}A_{i}\ 31^{L_{A}-i-1}\ mod\ 29
A="jizz",\ M=29,\ p=31
A="jizz", M=29, p=31A="jizz",\ M=29,\ p=31
=j\times31^{3}+i\times31^{2}+z\times31^{1}+z \ mod\ 29
=j×313+i×312+z×311+z mod 29=j\times31^{3}+i\times31^{2}+z\times31^{1}+z \ mod\ 29

TIOJ

  • 1306
  • 1321

String Matching

By hfy880916

String Matching

  • 565