Reward

\begin{cases} x_{t+1} & = \mathbf{A}x_t+\mathbf{B}_u u_t+\mathbf{B}_w w_t \\ y_t & = \mathbf{C}x_t + \mathbf{D}_u u_t + \mathbf{D}_v v_t \\ r_t &= y_t^T \mathbf{Q}y_t + u_t^T\mathbf{R}u_t \\ \hat{x}_{t+1} &= \mathbf{A}\hat{x}_t + \mathbf{B}_u u_t + \mathbf{L}(y_t - \mathbf{C}\hat{x}_t -\mathbf{D}_u u_t) \end{cases}

Assume the system dynamics are generated by the true state, the state generates the output, output generates reward, etc.

The goal is to learn a state representation that can accurately predict the dynamics of rewards.

Latent LQR

Output-feedback control problem (w\ unknown dynamics)

Optimal Output Feedback

Input-Output Plant Model

(Unknown)

Output-Based Reward:

y_{t+1}=f(y_t,y_{t-1},\cdots,u_t,u_{t-1},\cdots)

r_t=r(y_t,u_t)

Optimal Control based on Output-Feedback

Find actions that minimize sum of rewards.

There exists some internal state variable that generates the output, which is first-order (memoryless)

Input-Output / State-Space Equivalence

Recall that we don't know the dynamics. Traditional SysID learns a state-space model parameters by fitting input-output data (typically "supervised learning / least squares on input/output)

When output is very high dimensional and nonlinear, hard to learn a model for output. Let's try learning model for reward instead.

\begin{cases} x_{t+1}&=f(x_t,u_t) \\ y_t &= h(x_t,u_t) \\ r_t &= r(y_t, u_t) \end{cases}

\begin{cases} x_{t+1} & =\mathbf{A}x_t+\mathbf{B}u_t \\ y_t & = \mathbf{C}x_t + \mathbf{D}u_t \\ r_t & = y_t^T\mathbf{Q}y_t + u_t^T\mathbf{R}u_t \end{cases}

Reward Dynamics

Claim: There exists a state-space dynamical system that generates rewards. The goal is to estimate its parameters.

\begin{cases} z_t=L_\theta(y_t,y_{t-1},\cdots) \\ z_{t+1}=f(z_t,u_t) \\ r_t=h(z_t,u_t) \end{cases}

Latent Space Interpretation

z is a latent-space encoding of the output history.

\begin{cases} z_{t+1}=L(f(z_t,u_t),y_t) \\ z_{t+1}=f(z_t,u_t) \\ r_t=h(z_t,u_t) \end{cases}

Observer Interpretation

z is a latent-space encoding of the output history.

One-Step Further

Claim:

There exists a pretty wide class of systems where this latent variable has both linear dynamics, and generates a quadratic cost (this is a stronger requirement than the Koopman idea)
We rely on the nonlinear observer / encoder to find this space.
Once we find this embedding, run LQR online.
Can relax the cost function to be convex and change this to convex MPC.

\begin{cases} z_{t+1}=L_\theta(\mathbf{A}z_t+\mathbf{B}u_t,y_t) \\ z_{t+1}=\mathbf{A}z_t+\mathbf{B}u_t \\ r_t=z_t^T\mathbf{Q}z_t+u_t^T\mathbf{R}u_t \end{cases}

u_t=LQR(\mathbf{A},\mathbf{B},\mathbf{Q},\mathbf{R})

Optimal Projection Equations

Claim: There exists a state-space dynamical system that generates rewards.