String

Outline

Hash
Z-Algorithm
KMP
Manacher
Trie
AC Automaton
Suffix Array
Main-Lorentz

Hash

What is hash?

For a hash function \(f\),

\(x=y \Rightarrow f(x)=f(y) \)

\(x \neq y \Rightarrow f(x) \neq f(y) \) (very high prob.)

For two strings \(s,t\), if we want to know whether \(s\) and \(t\) are the same, we can hash them, and check if \(f(x)=f(y)\)

Rabin Karp

Given a string \(s_0...s_{n-1}\), define \(a[i]=s_i*p^i\)

Problem

TIOJ 1306

Given a string \(s\), answer \(q\) querys:

given a string \(t\), print the number of occurences of \(t\) in \(s\)

\(|s|, |t| \leq 10000\)

\(q \leq 50000 \)

\( \sum |t| \leq 350000\)

Solution

Use prefix sum on hash, and then we can check if a string of length \(|t|\) starting from every position of \(s\) matches in \( \Omicron (|s|) \).

Solution

#include <bits/stdc++.h>
#define IO ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);
#define int long long
using namespace std;

const int p=127,M=998244353;

int pref[10005],po[10005];

main(){
	IO
	po[0]=1;
	for(int i=1;i<10005;i++){po[i]=po[i-1]*p;po[i]%=M;}
	int tc;cin >> tc;
	while(tc--){
		string T;cin >> T;
		pref[0]=T[0];
		for(int i=1;i<T.length();i++) {pref[i]=pref[i-1]*p+T[i];pref[i]%=M;}
		int q;cin >> q;
		while(q--){
			int has=0,cnt=0;
			string P;cin >> P;
			for(int i=0;i<P.length();i++){has=has*p+P[i];has%=M;}
			for(int i=P.length()-1;i<T.length();i++){
				if(i==P.length()-1&&has==pref[i]) cnt++;
				else if(((pref[i]-(pref[i-P.length()]*po[P.length()])%M)+M)%M==has%M) cnt++;
			}
			cout << cnt << '\n';
	    }
	}
	return 0;
}

Z-algorithm

Given a string \(s_0...s_{n-1}\), define an array \(z\):

\(z[i]=\) the biggest \(k\) that satisfies

\(s_0...s_{k-1}=s_is_{i+1}...s_{i+k-1}\)

(\(k=0\) if \(s_0 \neq s_i\))

Calculate \(z\)

Say we know \(z[0] \sim z[i-1]\).

First, we try to find the lower bound of \(z[i]\)

let \(l= \argmax_{0 \leq j \leq i-1}l+z[j]-1, r=l+z[l]-1\)

\(\Rightarrow s_0...s_{r-l}=s_l...s_r\).

if \(i \leq r\), we know that \(s_{i-l}...s_{r-l}=s_i...s_r\),

\( \Rightarrow z[i]\) is at least \(min(z[i-l],r-i+1)\)

Calculate \(z\)

Then, we can repeatedly check if \(s[z[i]]=s[i+z[i]]\),

and update \(z[i]\).

Finally, we can update \(l,r\) if \(i+z[i]-1 > r\).

Notice that \(r\) is increasing, and every time \(r\) increases requires \( \Omicron (1) \), so the algorithm is \( \Omicron (n) \) amortized.

Implementation

vector<int> z_algo(string &s){
	int n=s.size();
	vector<int> z(n,0);
	for(int i=1,l=0,r=0;i<n;i++){
		if(i<=r) z[i]=min(z[i-l],r-i+1);
		while(i+z[i]<n&&s[z[i]]==s[i+z[i]]) z[i]++;
		if(i+z[i]-1>r) l=i,r=i+z[i]-1;
	}
	return z;
}

Problem

CSES Finding Borders

Given a string \(s\), find the number of strings which satisfy:

A string \(t\) is a prefix and also a suffix of \(s\).

Solution

Count of different \(i\) which \(i+z[i]-1=n\).

KMP

Given a string \(s_0...s_{n-1}\), define failure function \(p\):

\(p[i]=\) the biggest \(k<i+1\) that satisfy

\(s_0...s_{k-1}=s_{i-k+1}...s_i\)

Build

Say we know \(p[0] \sim p[i-1]\).

let \(j=i-1\), if \(s_{p[j]}=s_i \Rightarrow p[i]=p[j]+1\)

otherwise we can keep making \(j=p[j-1]\) when \(j \neq 0\), and check if the condition is satisfied.

Build

Notice that \(j\) will be added only \( \Omicron(n) \) times, so the algorithm is \( \Omicron(n) \) amortized.

Implementation

vector<int> kmp(string &s){
	int n=s.size();
	vector<int> pi(n,0);
	for(int i=1;i<n;i++){
		int j=pi[i-1];
		while(j>0&&s[i]!=s[j]) j=pi[j-1];
		if(s[i]==s[j]) j++;
		pi[i]=j;
	}
	return pi;
}

Problem

TIOJ 1306

Given a string \(s\), answer \(q\) querys:

given a string \(t\), print the number of occurences of \(t\) in \(s\)

\(|s|, |t| \leq 10000\)

\(q \leq 50000 \)

\( \sum |t| \leq 350000\)

Solution

For every \(t\), calculate its failure function.

Maintain \(r\) where we match two string to \(s_i\) and \(t_r\),

if \(s_{i+1} \neq t_{r+1}\), we can make \(r=p[r]\) and keep matching.

Why failure function?

The name comes from that, if we failed on matching, we can switch to the largest possible position instantly.

Manacher's Algorithm

Problem

CSES - Longest Palindrome

Given a string \(s\), find the longest palindrome substring.

\(|s| \leq 10^6 \)

First of all

Palindromes include two kinds:

1. odd length, center is a position in the string

2. even length, center is a position between two characters

Hard to deal with...

insert '*' between every two characters, the front and the end of the string, all palindromes become odd length (2*len+1)!

Construct

for a string \(s\) (after inserting '*'), define an array \(p\):

\(p[i]=\) the biggest \(k\) so that \(s_{i-k+1}...s_i=s_i...s_{i+k-1}\)

Then, how can we construct the array?

Calculate \(p\)

Say we have \(p[0] \sim p[i-1] \).

Let \(x= \argmax_{0 \leq j \leq i-1} j+p[j]-1\),

since \(s_{x-p[x]+1}...s_x=s_x...s_{x+p[x]-1}\),

\( \Rightarrow p[i] \geq min(p[2x-i], p[x]-(i-x))\)

Same idea with Z-algorithm!

Implementation

vector<int> manacher(string &ss){
  string s;
  s.resize(ss.size()*2+1,'.');
  for(int i=0;i<ss.size();i++){
    s[i*2+1]=ss[i];
  }
  vector<int> p(s.size(),1);
  for(int i=0,l=0,r=0;i<s.size();i++){
    p[i]=max(min(p[l*2-i],r-i),1LL);
    while(0<=i-p[i]&&i+p[i]<s.size()&&s[i-p[i]]==s[i+p[i]]){
      l=i,r=i+p[i],p[i]++;
    }
  }
  return p;
}

Trie

Implementation

//didn't compile
int ch[N][26]{0},cnt[N]{0},ptr=0;
void insert(string &s){
  int cur=0;
  for(int i=0;i<s.length();i++){
    if(!ch[cur][s[i]-'a']) ch[cur][s[i]-'a']=++ptr;
    cur=ch[cur][s[i]-'a'];
  }
  cnt[cur]++;
}

so ez la

Problem

2021 北市賽 pB

我忘記題目了lol

Problem

Given an array \(a\), find the pair \((i,j)\) where \(a_i \oplus a_j\) is the biggest among all pairs.

\(n \leq 10^5, a_i \leq 10^9 \)

AC Automaton

Problem

CSES - Finding Patterns

CSES - Counting Patterns

(The stronger version of TIOJ 1306)

\(|s| \leq 10^5\)

\( \sum |t| \leq 5*10^5\)

Couldn't AC with hash, Z, or kmp...

Aho-Corasick Algorithm

Trie with fail link!

A fail link from \(u\) to \(v\): \(v\) represents the longest suffix of \(u\) which exists in the trie.

Remember what failure function is?

Demo

Build

Tree edge: just a trie

Fail link: a simple bfs would work!

Implemtation

const int N=5e5+5;
int ch[N][26]{0},fail[N]{0},ptr=0;
 
void insert(string &s,int ind){
  int cur=0;
  for(int i=0;i<s.size();i++){
    if(!ch[cur][s[i]-'a']) ch[cur][s[i]-'a']=++ptr;
    cur=ch[cur][s[i]-'a'];
  }
}
 
void build(){
  queue<int> q;
  for(int i=0;i<26;i++) if(ch[0][i]) q.push(ch[0][i]);
  while(!q.empty()){
    int cur=q.front();q.pop();
    for(int i=0;i<26;i++){
      if(!ch[cur][i]) ch[cur][i]=ch[fail[cur]][i];
      else{
        q.push(ch[cur][i]);
        int tem=fail[cur];
        while(tem&&!ch[tem][i]) tem=fail[tem];
        fail[ch[cur][i]]=ch[tem][i];
      }
    }
  }
}

So how to solve the problem?

Matching: just walk on tree edge, and if there isn't one, take the fail link.

Maintain a count of times of visit on each vertex, and then a dfs is required.

Suffix Array

If I have seen further it is by standing on the shoulders of giants.

My implementation

const int N=2e5+5;
string s;
int n,p[N],pn[N],c0[N],c1[N],*c,*cn,cnt[N]{0};

void SA(string s){
	s+='$';//cyclic or not
	n=s.length();
	c=c0;cn=c1;
	//length=1
	for(int i=0;i<n;i++) cnt[s[i]]++;
	for(int i=0;i<256;i++) cnt[i]+=cnt[i-1];//256: sigma size
	for(int i=0;i<n;i++){
		p[--cnt[s[i]]]=i;
	}
	int cl=0;
	c[p[0]]=cl;
	for(int i=1;i<n;i++){
		if(s[p[i]]!=s[p[i-1]]) cl++;
		c[p[i]]=cl;
	}
	for(int k=1;k<n;k*=2){
		//sorting mp(c[i-k],c[i]), c[i] already sorted in p[]
		for(int i=0;i<=max(256LL,cl);i++) cnt[i]=0;//256
		for(int i=0;i<n;i++){
			pn[i]=p[i]-k;
			if(pn[i]<0) pn[i]+=n;
		}
		for(int i=0;i<n;i++){
			cnt[c[pn[i]]]++;
		}
		for(int i=1;i<=cl;i++) cnt[i]+=cnt[i-1];
		for(int i=n-1;i>=0;i--){
			p[--cnt[c[pn[i]]]]=pn[i];
		}
		cl=0;
		cn[p[0]]=cl;
		for(int i=1;i<n;i++){
			auto prev=mp(c[p[i-1]],c[(p[i-1]+k)%n]);
			auto cur=mp(c[p[i]],c[(p[i]+k)%n]);
			if(prev!=cur) cl++;
			cn[p[i]]=cl;
		}
		swap(c,cn);
	}
	//making all rank different
	//for(int i=0;i<n;i++) c[p[i]]=i;
}
//p: starting indices after sort
//c: rank of indices, may be same if '$' not added

int lcp[N][20],po[20];

void LCP(){
	po[0]=1;
	for(int i=1;i<20;i++) po[i]=po[i-1]*2;
	int k=0;
	for(int i=0;i<n;i++){
		if(c[i]==n-1){
			k=0;
			continue;
		}
		int j=p[c[i]+1];
		while(i+k<n&&j+k<n&&a[i+k]==a[j+k]) k++;
		lcp[c[i]][0]=k;
		if(k) k--;
	}
	for(int j=1;j<20;j++){
		for(int i=0;i<n-1;i++){
			if(i+po[j-1]<n-1) lcp[i][j]=min(lcp[i][j-1],lcp[i+po[j-1]][j-1]);
		}
	}
}
//lcp[i][0]: longest common prefix of s.substr(p[i]), s.substr(p[i+1])
//lcp: a sparse table

int qry(int i,int j){
	i=c[i],j=c[j];
	if(i>j) swap(i,j);
	int lg=__lg(j-i);
	return min(lcp[i][lg],lcp[j-po[lg]][lg]);
}

從這邊學的

順道提醒大家北市賽一定要好好喇分！

(去年好像有人說Sam會成為建中培訓的教材，這不就來了嗎)

Problems

我也沒看過

TIOJ 1927

ARC 151E

ABC 268 Ex

CF 1562 E

CF 1721 E

CF 985 F

CF 1366 G

CF 1363 F

CF 1313 E

Main-Lorentz

Problem

Finding repetitions - Algorithms for Competitive Programming (cp-algorithms.com)

Just learned this algorithm this Tuesday, very cool though.