Reward
Assume the system dynamics are generated by the true state, the state generates the output, output generates reward, etc.
The goal is to learn a state representation that can accurately predict the dynamics of rewards.
Latent LQR
Output-feedback control problem (w\ unknown dynamics)
Optimal Output Feedback
Input-Output Plant Model
(Unknown)
Output-Based Reward:
Optimal Control based on Output-Feedback
Find actions that minimize sum of rewards.
There exists some internal state variable that generates the output, which is first-order (memoryless)
Input-Output / State-Space Equivalence
Recall that we don't know the dynamics. Traditional SysID learns a state-space model parameters by fitting input-output data (typically "supervised learning / least squares on input/output)
When output is very high dimensional and nonlinear, hard to learn a model for output. Let's try learning model for reward instead.
Reward Dynamics
Claim: There exists a state-space dynamical system that generates rewards. The goal is to estimate its parameters.
Latent Space Interpretation
z is a latent-space encoding of the output history.
Observer Interpretation
z is a latent-space encoding of the output history.
One-Step Further
Claim:
- There exists a pretty wide class of systems where this latent variable has both linear dynamics, and generates a quadratic cost (this is a stronger requirement than the Koopman idea)
- We rely on the nonlinear observer / encoder to find this space.
- Once we find this embedding, run LQR online.
- Can relax the cost function to be convex and change this to convex MPC.
Optimal Projection Equations
Claim: There exists a state-space dynamical system that generates rewards.
LatentLQR
By Terry Suh
LatentLQR
- 102