Sarah Dean PRO
asst prof in CS at Cornell
Prof Sarah Dean
\(\lambda_t\)
\(\theta_t\)
Reference: Bof, Carli, Schenato, "Lyapunov Theory for Discrete Time Systems"
Consider the dynamics of stochastic gradient descent on a twice differentiable function \(g:\mathbb R^d\to\mathbb R^d\)
\(\theta_{t+1} = \theta_t - \alpha g_t\)
\(= \theta_t - \alpha \nabla g(\theta_t) + \underbrace{\alpha (\nabla g(\theta_t) - g_t) }_{w_t} \)
\(s\)
\(s_{t+1} = F(s_t, w_t)\)
\(y_t = G(s_t)\)
\(w_t\)
\(y_t\)
input signal: external phenomena
output signal: measurements
\(s\)
\(s_{t+1} = F(s_t, w_t)\)
\(y_t = G(s_t)+v_t\)
\(w_t\)
\(y_t\)
input signal: external phenomena
output signal: noisy measurements
\(v_t\)
\(s\)
\((y_0, y_1,...) = \Phi_{F,G}[(s_0, w_0, w_1,...)]+(v_0,v_1,...)\)
\(w_t\)
\(y_t\)
\(v_t\)
\((y_0, y_1,...) = \Phi_{F,G}[(s_0, w_0, w_1,...)]+(v_0,v_1,...)\)
If \(F,G\) known (e.g. physics), then can compute \(y_{t+1}\) given
But only observe \(y_{0:t}\)!
The state \(s=[\theta,\omega]\), output \(y=\theta\), and
$$\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$
Linear dynamics and measurement
\(s_{t+1} = As_t + w_t\)
\(y_t = Cs_t + v_t\)
At time \(t\), we have observed \(y_{0:t}\), which we can use to estimate $$\hat s_{k\mid t}$$
our guess of \(s_k\) given observations up to time \(t\)
At time \(t\), our observations and models tell us that
\(\vdots\)
At time \(t\), our observations and models tell us that
\(\vdots\)
At time \(t\), our observations and models tell us that
$$\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \begin{bmatrix} C\\ A-I \\ &C\\ & A-I \\ &&\ddots \\ &&&C \end{bmatrix} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$
The least squares estimator minimizes the squared error
$$\text{s.t.}~~~\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \underbrace{\begin{bmatrix} C\\ A&-I \\ &C \\ & A&-I \\ &&\ddots \\ &&&C\end{bmatrix}}_{A_C} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$
$$\min_{\textcolor{yellow}{s,v,w}} ~~~\sum_{k=0}^{t-1} \textcolor{yellow}{\|w_k\|_2^2 + \|v_k\|^2 + \|v_t\|^2+\|s_0\|_2^2} $$
The least squares estimator minimizes the squared error
$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A_C^\top A_C)^{-1} A_C^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$
1. Classic least squares regression
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
$$ y_0 = s_0 + v_0$$
\(\hat s_{0\mid 0} = y_0\)
2. Least squares filtering
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
$$ \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\end{bmatrix}$$
$$ \begin{bmatrix} 2 & -1 \\ -1 & 2\end{bmatrix}^{-1} \begin{bmatrix} 1 & 1\\ & -1 & 1 \end{bmatrix} \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} $$
\(\hat s_{0\mid 1} = \frac{2y_0+y_1}{3}\)
\(\hat s_{1\mid 1} = \frac{y_0+2y_1}{3}\)
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
$$ \begin{bmatrix} y_0\\ 0 \\ y_1\\ 0 \\ y_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1\\ & 1 & -1 \\ && 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\\ s_2\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\\w_1\\v_2\end{bmatrix}$$
\(\hat s_{0\mid 2} = \frac{5y_0+2y_1+1y_2}{8}\)
\(\hat s_{1\mid 2} = \frac{2y_0+4y_1+2y_2}{8}\)
\(\hat s_{2\mid 2} = \frac{y_0+2y_1+5y_2}{8}\)
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
1. Classic least squares regression
2. Least squares filtering
$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A_C^\top A_C)^{-1} A_C^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$
$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}C^\top C + A^\top A & -A^\top \\ -A & C^\top C+A^\top A + I & -A^\top \\ & -A & \ddots & \\ &&& C^\top C+I\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \end{bmatrix} $$
Block tri-diagonal matrix inverse
$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}D_1 & -A^\top \\ -A &D_2 & -A^\top \\ & -A & \ddots & \\ &&&D_3\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \end{bmatrix} $$
$$\textcolor{yellow}{\begin{bmatrix}\hat s_{0\mid t+1}\\ \hat s_{1\mid t+1} \\ \vdots \\ \hat s_{t\mid t+1} \\ \hat s_{t+1\mid t+1} \end{bmatrix}} =\begin{bmatrix}D_1 & -A^\top \\ -A & D_2 & -A^\top \\ & -A & \ddots & \\ &&& D_3+\textcolor{yellow}{A^\top A} & \textcolor{yellow}{-A^\top}\\ &&& \textcolor{yellow}{-A} &\textcolor{yellow}{ C^\top C + I}\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \\ \textcolor{yellow}{C^\top y_{t+1}}\end{bmatrix} $$
Block tri-diagonal matrix inverse
Possible to write \(\hat s_{t+1\mid t+1}\) as a linear combination of \(\hat s_{t\mid t}\) and \(y_{t+1}\)
\(y_t\)
\(\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)\)
\(s_{t+1} = As_t + w_t\)
\(y_t = Cs_t + v_t\)
Kalman Filter
\(\hat s_t\)
\(s\)
\(w_t\)
\(v_t\)
\(C\)
\(y_t\)
\(s_{t+1} = As_t + w_t\)
\(y_t = Cs_t + v_t\)
Kalman Filter
\(\hat s_t\)
\(A-L_tCA\)
\(\hat s\)
\(L_t\)
\(s\)
\(w_t\)
\(v_t\)
\(C\)
\(\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)\)
\(\displaystyle \min_{{s,v,w}} ~~~\sum_{k=0}^{t-1} \|\)\(\Sigma_{w,k}^{-1/2}\)\(w_k\|_2^2 + \|\)\(\Sigma_{v,k}^{-1/2}\)\(v_k\|^2+\|\)\(\Sigma_{s}^{-1/2}\)\(s_0\|_2^2 \)
$$\text{s.t.}~~~\bar{y}_{0:t} = A_C s_{0:t}+ \bar w_{0:t} + \bar v_{0:t}$$
Next time: when is estimation possible? What if dynamics matrices are unknown?
By Sarah Dean