report is made by
Pavel Temirchev
Deep RL
reading group
Motivation
Contents
Some remarks
Background: MDP
Text
where \( a = \pi(x) \)
is transition probabilities
Background: MDP
- is a finite discrete state space
- is a finite discrete action space
- is a finite discrete domain on each feature in \(x\)
- is a finite discrete domain on each feature in \(a\)
x, a are then onehot encoded
Background:
Linear Programming task for MDP
s.t.
We have LP solvers.
Is this task tractable?
1) Too many variables to minimize
2) Too many summation terms
3) Too many terms in expectations
4) We even can't store transition p-s
5) Exponential number of constraints
Background: Factored MDP
We need concise representation of the transition probabilities
Let:
And further:
Let's use Dynamical Bayesian Network representation
4) We even can't store transition probabilities
Background:
Approximate Linear Programming
1) Too many variables to minimize
2) Too many summation terms
Let
And let's denote
So
Where
- are some basis functions
Background:
Approximate Linear Programming
If we assume the same initial distribution factorization
We will get a new LP task:
Background:
Approximate Linear Programming
PROOF:
Background: ALP + Factored MDP
3) Too many terms in expectations
Constraints for the LP problem:
It may be rewriten as:
And even further decompose rewards as:
Background: ALP + Factored MDP
PROOF:
Background: Constraint Generation
5) Exponential number of constraints
MVC search:
Use a black-box solver SCIP
for Mixed Integer Programming
In our case \( x, a \) - are Boolean vectors
Background: Logistic Regression
Text
Need more? Try Google, it's free
Logistic Regression
STATE
ACT
ION
RESPONSE
t
t+1
MDP
Logistic Markov Decision Processes
t
t+1
We allow response \( \phi^t \) to influence user's state at \( t+1 \) timestep
Factored Logistic MDP
Transition Dynamics:
Reward function:
Hence, our backprojections \( g_i \) now dependent on complete \( x \) and \( a \) vectors
Let's rewrite \( Q(x, a) \) as:
ALP for Logistic MDP
Let's denote:
Then ALP task may be reformulated as:
Constraints are now nonlinear since \( p(\phi|x,a) \) is nonlinear. MCV search is not MIP problem now
s.t.
ALP for Logistic MDP
We will denote
And
Constant Approximation
Where \(\sigma^*\) is some constant
We consider two subsets of possible \( (x, a) \) pairs:
where the constraint is non-decreasing with \(\sigma^*\) growth
where the constraint is non-increasing with \(\sigma^*\) growth
We denote by \(U^u\) the solution for
s.t.
And by \( U^l \) the solution with \( \sigma_l \) and \( H^- \) instead of \( \sigma_u \) and \( H^+ \)
Constant Approximation
Same thing for the \( U^l \) and \( C^l \) in the subset \( (x,a) \in [f_l,f_u] \cap H^- \)
Hence,
is an upper bound on MCV in \( [f_l, f_u] \)
is a lower bound on MCV in \( [f_l, f_u] \)
Constant Approximation
CV
\( \sigma \)
\( C(x^{(2)}, a^{(2)}, \sigma) \)
\( C(x^{(1)}, a^{(1)}, \sigma) \)
\( U^u \)
\( \sigma_l \)
\( \sigma_u \)
\( \sigma(f(x^{(2)}, a^{(2)})) \)
\( \sigma(f(x^{(1)}, a^{(1)})) \)
The degree of CV for two state-action pairs as a function of \( \sigma \)
MVC search in ALP-SEARCH
1) Solve two MIP tasks for some interval \( [f_l, f_u] \) :
s.t.
s.t.
2) If \( U^* < \epsilon \) then there are no constrain violation in \( [f_l, f_u] \) and we terminate
3) If \( U^* - C^* < \epsilon \) then we report that \( C^* \) is a MCV in \( [f_l, f_u] \) and we terminate
3) If \( C' \) in another interval is larger than \( C^* \) then we terminate
If nothing from above holds we divide interval into two and recursively repeat
Piece-Wise Constant approximation
A piece-wise constant approximation of the sigmoid
MVC search in ALP-APPROX
s.t.
where
Then we calculate:
- true CV, probably not a maximal one in \([\delta_{i-1}, \delta_i]\)
- estimation of the maximal CV in \([f_l, f_u]\)
Approximation error in ALP-APPROX
THEOREM
A bounded log-relative error for the logistic regression (assuming features with finite domains) can be achieved with \( O(\frac{1}{\epsilon} ||u||_1) \) intervals in logit space, where \(u\) is the logistic regression weight vector
THEOREM
Given an interval \( [a, b] \) in logit space, the value \(\sigma(x)\) with $$ x = \ln \frac{e^{a+b} + e^b}{1 + e^b} $$ minimizes the log-relative error over the interval.
Experiments
Text
Experiments
Text
Model sizes:
Experiments
Experiments
Experiments
Extensions
Text
Partition-free CG
which is equivalent to
s.t.
The simple idea is to iteratively alternate between two steps:
But it will be stuck in local optima almost surely
Partition-free CG
Another approach is to consider Lagrangian relaxation:
Primal-Dual alternating optimization:
Initialize
for
do:
end for
Non-linear Response Model
Text
We consider wide-n-deep response model:
If we can express non-linearity in DNN in such way that the input to the final logistic output formulates as a linear-like function of \( (x, a) \) then CG optimization will be the same as for Logistic Regression response model.
ReLu non-linearity may be expressed in such way using just one or two indicator functions per unit
Thanks for your
attention!