Joint work with Elvis Dohmatob and Jeremie Mary
Environment
Agent
Action
Next state
Reward
Current state
Environment is unknown!
Learn safely under uncertainty
Prepare for the worst case
But not too conservative!
?
Safety w.r.t. finite amount of experience
Approximation for continuous action space
SOTA
Dynamic Programming
$$M := (\mathcal{S}, \mathcal{A}, P, r, \gamma)$$
How to act in a state
How good is a state
\(\pi(a|s)\) is a probability of chosing action \(a\) in state \(s\)
Bellman operator
Policy and Value Iteration
SOTA: Actor-Critic algorithms
Policy Gradient Theorem
REINFORCE
SOTA: Trust Region Policy Optimization
Q-value of an action * how often this action is taken
Exists and unique
Useful to show convergence
Greedy policy choses action that maximizes Q-value
\(m = \infty\) policy iteration
\(m=1\) value iteration
Errors
Optimization ~ mountain descent
RL ~ mountain descent in a bad weather
Fastest descent might be dangerous due to uncertainty over the landscape...
Errors
finite sample of data
neural network value function
Difference between the exact BO and what we compute
Lower bound
Upper bound
Risk-averse
Exact BO
Overly optimistic
adversarial temperature
1-D convex optimization (scipy.optimize.bisect)
Too conservative
Too optimistic
Define the decrease rate \( \rho = \lim \epsilon_N / \epsilon_{N-1}\).
where \(V_t\) is the value function computed via exact evaluation step.
The normalization constant of the adversarial policy is an intractable integral
Risk-neutral \(\lambda \rightarrow \infty\)
Risk-seeking \(\lambda > 0\)
log-moment generating function
Taylor series of logsumexp up to the 2nd order at \(\lambda \rightarrow \infty\)
ABO ~ BO + variance penality = reward change
For \(\lambda > 0\) encourages to visit states with smaller variance
Using Taylor expansion of Q-values around mean action
Cautious short-term and optimistic long-term!
Entropy-regularized SOTA
Simple reward modification
Longer episodes
Stable learning score
SAC (Baseline) vs. Safe SAC (Proposed)
Quite stable already
Stable learning score
SAC (Baseline) vs. Safe SAC (Proposed)
Hopper | Walker2D | |
---|---|---|
Return Avg | Similar | Similar |
Return Std | -76% +/- 21 | -78% +/- 48 |
Episode Len Avg | Similar | Similar |
Episode Len Std | -76% +/- 13 | -77% +/- 42 |
Percent change w.r.t. SAC
def _get_adv_reward_cor(self, q1_mu, q1_mu_targ, mu, mu_targ, std, std_targ):
# state visit counter
n_s = self._n_s_ph if self._use_n_s else self._total_timestep_ph
# size of uncertainty set
adv_eps = tf.divide(self._adv_c, tf.pow(tf.cast(n_s, tf.float32), self._adv_eta))
# approximate standard deviation of Q-values at current and next states
g0 = tf.gradients(q1_mu, mu)[0]
g0_targ = tf.gradients(q1_mu_targ, mu_targ)[0]
approx_q_std = self._approx_q_std_2_order(g0, q1_mu, mu, std, self._observations_ph)
approx_q_std_targ = self._approx_q_std_2_order(g0_targ, q1_mu_targ, mu_targ, std_targ,
self._next_observations_ph)
# approximate adversarial parameter
adv_lambda = tf.divide(approx_q_std, tf.sqrt(2*adv_eps))
# safe reward correction (simplified by substituting lambda approximation)
adv_reward_cor = 1. / (2*adv_lambda) *
(self._discount*approx_q_std_targ - approx_q_std)
return adv_reward_cor
lower bound w.r.t. estimation errors
using convex duality
to the optimal policy
approximation for continuous control
exploration strategy