Assume the system dynamics are generated by the true state, the state generates the output, output generates reward, etc.
The goal is to learn a state representation that can accurately predict the dynamics of rewards.
Output-feedback control problem (w\ unknown dynamics)
Optimal Output Feedback
Input-Output Plant Model
(Unknown)
Output-Based Reward:
Optimal Control based on Output-Feedback
Find actions that minimize sum of rewards.
There exists some internal state variable that generates the output, which is first-order (memoryless)
Input-Output / State-Space Equivalence
Recall that we don't know the dynamics. Traditional SysID learns a state-space model parameters by fitting input-output data (typically "supervised learning / least squares on input/output)
When output is very high dimensional and nonlinear, hard to learn a model for output. Let's try learning model for reward instead.
Reward Dynamics
Claim: There exists a state-space dynamical system that generates rewards. The goal is to estimate its parameters.
Latent Space Interpretation
z is a latent-space encoding of the output history.
Observer Interpretation
z is a latent-space encoding of the output history.
One-Step Further
Claim:
Optimal Projection Equations
Claim: There exists a state-space dynamical system that generates rewards.