Hardness of Carrots

State Collapse in Reward Symmetry

(\mathbf{I}^L_t, r^L_t)
(\mathbf{I}^R_t, r^R_t)
(\mathbf{I}^L_{t+1}, r^L_{t+1})
(\mathbf{I}^R_{t+1}, r^R_{t+1})
z^L_t
z^L_{t+1}
z^R_t
z^R_{t+1}
\sigma
\sigma
\sigma
\sigma
\begin{aligned} \hat{z}^L_{t+1} & =\mathbf{A}z^L_t + \mathbf{B}u_t \\ \hat{r}^L_{t} & =\mathbf{C}z^L_t + \mathbf{D} u_t \end{aligned}
\begin{aligned} \hat{z}^R_{t+1} & =\mathbf{A}z^R_t + \mathbf{B}u_t \\ \hat{r}^R_{t} & =\mathbf{C}z^R_t + \mathbf{D} u_t \end{aligned}

Will it learn a radially symmetric z?

Hypothesis:

1. The requirement to predict image differentiates states in radially symmetric regions.

2. But if we just predict the reward, AIS won't differentiate between the same states in the level states.

3. If the reward and dynamics are the same, there is no reason to (it's a bisimulation)

State Collapse in Reward Symmetry

Sensitivity Analysis

Take two different images, one disturbed by a very minute disturbance. What is the sensitivity of this disturbance to reward prediction?

\mathbf{I}_t
\tilde{\mathbf{I}}_t=\mathbf{I}_t + \mathbf{W}
\sigma
\sigma
z_t
\tilde{z}_t=z_t+w_t
\begin{aligned} \hat{z}_{t+1} & =\mathbf{A}z_t+\mathbf{B}u_t \\ \hat{r}_t & = \mathbf{C}z_t + \mathbf{D}u_t \end{aligned}
\begin{aligned} \tilde{z}_{t+1} & =\mathbf{A}z_t + \mathbf{B}u_t + \mathbf{A}w \\ \tilde{r}_t & = \mathbf{C}z_t + \mathbf{D}u_t + \mathbf{C}w \end{aligned}
|r_t - \tilde{r}_t| = \mathbf{C}w
|r_{t+1} - \tilde{r}_{t+1}| = \mathbf{CA}w

Having a small w is necessary for predicting errors in the long term. Whether AIS learns this representation is a separate question.

|r_{t+2} - \tilde{r}_{t+2}| = \mathbf{CA}^2w

Metric Comparison?

\begin{aligned} z_t & = \sigma(y_t) \\ \hat{z}_{t+1} & =\mathbf{A}z_t+\mathbf{B}u_t \\ \hat{r}_t & = \mathbf{C}z_t + \mathbf{D}u_t \end{aligned}
r-r' = \mathbf{C}(z - z')
|r-r'| \leq \|\mathbf{C}\|_2 \|z - z'\|_2

Numerical Issues with Scale

\begin{aligned} \hat{z}_{t+1} & =\mathbf{A}z_t+\mathbf{B}u_t \\ \hat{r}_t & = \mathbf{C}z_t + \mathbf{D}u_t \end{aligned}

We know that AIS isn't unique, but this leads to issues with numerical scaling.

 

Consider two networks whose latent states are offset by a scale factor lambda. Then they have the exact same capability to predict rewards, but numerical issues kick in for arbitrarily large values of lambda.

\lambda z_k = \tilde{z}_k=\tilde{\sigma}(\mathbf{I}_k)
z_k=\sigma(\mathbf{I}_k)
\begin{aligned} \tilde{z}_{t+1} & = \mathbf{A}\tilde{z}_t+\lambda\mathbf{B}u_t \\ \tilde{r}_t & = \frac{1}{\lambda}\mathbf{C}\tilde{z}_t + \mathbf{D}u_t \end{aligned}

Big Problems

Distribution Shift

State-Space Coverage

(Robust Learning)

Distributionally Robust Optimization

Data Augmentation

Regularization

Robust Control

Uncertainty Quantification

MLE 

Variance Penalized MPC

Uncertainty Aware Control

Bootstrapping

Online RL

Chance-Constrained MPC

deck

By Terry Suh