Sequential Data: Markov Models

ML in Feedback Sys #3

Fall 2025, Prof Sarah Dean

Please open a large, searchable text document on your laptop or phone

Shannon's Language Generation

described in "A Mathematical Theory of Communication" (Shannon, 1948)

and in The Hedgehog Review article Language Machinery

exercise inspired by Jordan Ellenberg

Bigram generation

"What we do"

  • Given: sequences \(\{\{x_{k,i}\}_{k=1}^{K_i}\}_{i=1}^n\) of symbols \(x\in\mathcal X\) in a finite alphabet
  • Sample \(\hat x_0\) at random from data
  • For \(t=1,2,...\)
    • Collect instances in data where \(x_{k,i} = \hat x_{t-1}\) $$\mathcal S = \{x_{k+1,i} | x_{k,i} = \hat x_{t-1}\}$$
    • Sample \(\hat x_t\) (uniformly) at random from \(\mathcal S\)

(for now, assume \(\mathcal S\) is never empty)

Bigram prediction

"What we do"

  • Given: sequences \(\{\{x_{k,i}\}_{k=1}^{K_i}\}_{i=1}^n\) of symbols \(x\in\mathcal X\) in a finite alphabet
  • Given new partial sequence \( x_{0:t}\)
  • Collect instances in data where \(x_{k,i} =  x_{t}\) $$\mathcal S = \{x_{k+1,i} | x_{k,i} =  x_{t}\}$$
  • Predict the mode of \(\mathcal S\)

Estimating Markov transitions

"Why we do it"

  • One-hot embedding of categorical data $$e_x = \begin{bmatrix} 0& \dots & 0 & 1 & 0 & \dots & 0\end{bmatrix}^\top \in \mathbb R^{|\mathcal X|} \\\qquad \uparrow \text{index }x$$
  • The linear least squares empirical risk minimization problem $$\min_{\Theta\in \mathbb R^{|\mathcal X|\times |\mathcal X|}}\sum_{k=1}^{K_i-1} \sum_{i=1}^n \|\Theta^\top e_{x_{k,i}} - e_{x_{k+1,i}}\|_2^2$$
  • Fact 1: Predicting \(\hat x_{t+1}=\max_x (\hat \Theta^\top e_{x_t})_x\) is equivalent to bigram prediction
  • Fact 2: Generating \(\hat x_{t+1}\sim \hat \Theta^\top e_{\hat x_t}\) is equivalent to bigram generation
  • Fact 3: We can understand \(\Theta\) as the transition matrix for a finite state Markov chain

Empirical counts

$$\min_{\Theta\in \mathbb R^{|\mathcal X|\times |\mathcal X|}} \sum_{k=1}^{K_i-1} \sum_{i=1}^n \|\Theta^\top e_{x_{k,i}} - e_{x_{k+1,i}}\|_2^2$$

  • From last week we know that as long as the data is sufficiently rich $$ \hat\Theta = \Big(\sum_{k=1}^{K_i-1} \sum_{i=1}^n e_{x_{k,i}} e_{x_{k,i}}^\top \Big)^{-1} \sum_{k=1}^{K_i-1} \sum_{i=1}^n e_{x_{k,i}} e_{x_{k+1,i}}^\top $$
  • Let \(n_x \) be the number of times unigram \(x\) appears in the data and \(n_{x,x'} \) be the number of times the bigram \(x, x'\) appears in the data $$ n_x = \sum_{k=1}^{K_i-1} \sum_{i=1}^n \mathbf 1 \{x_{k,i}=x\},\quad n_{x,x'} =\sum_{k=1}^{K_i-1} \sum_{i=1}^n \mathbf 1 \{x_{k,i}=x\}\mathbf 1 \{x_{k+1,i}=x'\}  $$
  • The solution is $$ \hat\Theta = \begin{bmatrix} n_1 \\ & \ddots \\ && n_{|\mathcal X|} \end{bmatrix} ^{-1} \begin{bmatrix} n_{1,1} & \dots & n_{1,|\mathcal X|} \\ \vdots & \ddots & \vdots \\ n_{|\mathcal X|,1} & \dots & n_{|\mathcal X|,|\mathcal X|} \end{bmatrix} = \begin{bmatrix} \frac{n_{1,1}}{n_1} & \dots & \frac{n_{1,|\mathcal X|}}{n_1} \\ \vdots & \ddots & \vdots \\ \frac{ n_{|\mathcal X|,1} }{ n_{|\mathcal X|}} & \dots & \frac{ n_{|\mathcal X|,|\mathcal X|} }{ n_{|\mathcal X|}}\end{bmatrix} $$

Empirical counts

$$\min_{\Theta\in \mathbb R^{|\mathcal X|\times |\mathcal X|}} \sum_{k=1}^{K_i-1} \sum_{i=1}^n \|\Theta^\top e_{x_{k,i}} - e_{x_{k+1,i}}\|_2^2$$

  • The solution is $$ \hat\Theta = \begin{bmatrix} n_1 \\ & \ddots \\ && n_{|\mathcal X|} \end{bmatrix} ^{-1} \begin{bmatrix} n_{1,1} & \dots & n_{1,|\mathcal X|} \\ \vdots & \ddots & \vdots \\ n_{|\mathcal X|,1} & \dots & n_{|\mathcal X|,|\mathcal X|} \end{bmatrix} = \begin{bmatrix} \frac{n_{1,1}}{n_1} & \dots & \frac{n_{1,|\mathcal X|}}{n_1} \\ \vdots & \ddots & \vdots \\ \frac{ n_{|\mathcal X|,1} }{ n_{|\mathcal X|}} & \dots & \frac{ n_{|\mathcal X|,|\mathcal X|} }{ n_{|\mathcal X|}}\end{bmatrix} $$
  • Note that \(\hat\Theta\) is a stochastic matrix, meaning that the rows sum to 1 and all entries are nonnegative

Prediction

Fact 1: Predicting \(\hat x_{t+1}=\max_x (\hat \Theta^\top e_{x_t})_x\) is equivalent to bigram prediction, i.e. choosing the mode of \(\mathcal S = \{x_{k+1,i} | x_{k,i} =  x_{t}\}\)

  • \((\hat \Theta^\top e_{x_t})_x\) = entry \(x\) of row \(x_t\) of \(\Theta\) = \(\Theta_{x_t,x} = \frac{n_{x_t,x}}{n_{x_t}} \)
  • Therefore the maximization is equivalent to \(\max_x n_{x_t,x}\)
  • In other words, choosing the symbol \(x\) which appears
    most frequently following \(x_t\)
  • This is exactly the mode of \(\mathcal S\)

Generation

Fact 2: Sampling \(\hat x_{t+1}\sim \hat \Theta^\top e_{ x_t}\) is equivalent to bigram generation, i.e. choosing uniformly at random from \(\mathcal S=\{x_{k+1,i} | x_{k,i} =  x_{t}\}\)

  • \(\hat \Theta^\top e_{x_t}\) = row \(x_t\) of \(\Theta\) = \(  \begin{bmatrix} \frac{n_{x_t,1}}{n_{x_t}} & \dots & \frac{n_{x_t,|\mathcal X|}}{n_{x_t}} \end{bmatrix}\)
  • In other words, choosing the symbol \(x\) with probability \(\frac{n_{x_t,x}}{n_{x_t}}\)
  • Notice that \(n_{x_t,x}\) is the number of times \(x\) appears in \(\mathcal S\) and
    \(n_{x_t}\) is the overall size of \(\mathcal S\)
  • Thus, this is equivalent to selecting uniformly at random from \(\mathcal S\)

Incomplete data

  • What if \(\mathcal S\) is empty because we have never seen \(x_t\) before?
  • Typical "back-off" technique
    • Generation: pick a symbol (uniformly) at random from the dataset
    • Prediction: predict the most frequent symbol in the dataset
  • The minimum (Frobenius) norm solution \(\hat \Theta\) would have row \(x_t\) equal to zero $$\hat\Theta_{x_t,x} = \frac{n_{x_t,x}}{n_{x_t}} ~~\text{if}~~ n_{x_t}>0~~\text{else}~~0$$
  • The back-off technique can be represented by replacing row \(x_t\) with the empirical symbol frequencies $$\tilde\Theta_{x_t,x} = \frac{n_{x_t,x}}{n_{x_t}} ~~\text{if}~~ n_{x_t}>0~~\text{else}~~ \frac{n_{x}}{ \sum_{x'\in\mathcal X} n_{x'} }$$
  • Challenge: is \(\tilde \Theta\) the solution to a min-norm least squares problem for a different norm? Or with the added constraint that \(\Theta\) must be a stochastic matrix?

Markov chains

Fact 3: Sampling \( x_{t+1}\sim  \Theta^\top e_{ x_t}\) \(\iff\) finite state Markov
chain with transition matrix \(\Theta\)

  • A Markov chain describes a stochastic sequence of states in which the distribution of the state at \(t+1\) depends only on the state attained at \(t\)
  • Mathematically,
    \(\mathbb P\{X_{t+1} = x|X_0=x_0,...,X_t=x_t\} =\)
    \(\qquad = \mathbb P\{X_{t+1}=x|X_t=x_t\} = \Theta_{x_t,x} \)

Transition matrix

  • The transition matrix \(P\in\mathbb R^{N\times N}\) is a stochastic matrix defined as $$P_{ij}=\mathbb P\{X_{t+1}=j|X_t=i\} $$
  • Fact 4: The \(n\) step transition probabilities are given by matrix powers $$(P^n)_{ij}=\mathbb P\{X_{t+n}=j|X_t=i\} $$
  • Fact 5: The state probability distribution evolves according $$ \mu_{t+1} = P^\top \mu_t ,\qquad [\mu_t]_i = \mathbb P\{X_t = i\}$$
  • Fact 6: The maximum eigenvalue of \(P^\top\) is equal to \(1\). Each associated eigenvector \(\mu_\star\) is a stationary distribution \( \mu_\star = P^\top \mu_\star\)
  • Fact 7: If the second largest-magnitude eigenvalue \(\rho=|\lambda_2(P^\top)|<1\), then the stationary distribution is unique and for any initial distribution \(\mu_0\) $$ \|\mu_t-\mu_\star\|_1 \leq \rho^t \|\mu_0-\mu_\star\|_1 $$

Markov chain properties

  • Definition: state \(j\) is accessible from \(i\), written \(i\rightarrow j\), if there exists an \(n\geq 0\) with \(P^n_{ij} > 0\)
  • Definition: A Markov chain is irreducible if \(i\leftrightarrow j\) for all \(i,j\)
    • example: \(p_1<1\) and \( p_2<1\)
  • Definition: The period of state \(i\) is \(d_i = \mathrm{gcd}\{n\geq 1|P^n_{ii}>0\}\) the greatest common divisor of nonzero return times. A state is aperiodic if \(d_i=1\), a Markov chain is aperiodic if all states are
    • example: \(p_1>0\) and \( p_2>0\)
  • Fact 7: An irreducible and aperiodic (aka ergodic) finite Markov chain has second largest-magnitude eigenvalue
    \(\rho=|\lambda_2(P^\top)|<1\) (and thus converges geometrically to a unique stationary distribution)

\(1\)

\(2\)

\(P = \begin{bmatrix} p_{1} & 1-p_1 \\ 1-p_2 & p_2 \end{bmatrix}\)

  • Setting: finite alphabet (state) sequences
  • Bigram model \(\iff\) least squares with
    • Categorical "one hot" embeddings
    • Prediction from a single past time step
  • Markov chains formalize single time step dependence
    • transition matrix, stationary distribution,
      mixing time

Summary: Markov models

Recap

  • Bigram prediction and generation
  • Markov chains

Next time: continuous state models

Announcements

03 - Sequential Data: Markov Models - ML in Feedback Sys F25

By Sarah Dean

03 - Sequential Data: Markov Models - ML in Feedback Sys F25

  • 47