Approximate Linear Programming for Markov Decision Processes
report is made by
Pavel Temirchev
Deep RL
reading group
Motivation
- We want to model Interaction with Users
- User state - is the Environment state
- Our actions are adds, recommendations, etc.
- Myopic predictions - current approach
- We want to maximize return for Long-Term interactions
- And we want to use pretrained myopic models (such as Logistic Regression)
- State-Action space is very large, usually discrete and sparse!
Contents
- Background
- MDP, Factored MDP
- Approximate Linear Programming
- Logistic Regression
- Logistic MDP
- Factored Logistic MDP
- ALP for Logistic MDP
- Exact Sequential Approach
- Piece-Wise Constant Approximation
- Error Analysis
- Experiments
- Extensions
Some remarks
- Model-Based method
- We do not model Transition Dynamics, it is given from sky
- Not really RL
- Not really Deep
- Work is in the progress
Background: MDP
Text
where \( a = \pi(x) \)
is transition probabilities
Background: MDP
- is a finite discrete state space
- is a finite discrete action space
- is a finite discrete domain on each feature in \(x\)
- is a finite discrete domain on each feature in \(a\)
x, a are then onehot encoded
Background:
Linear Programming task for MDP
s.t.
We have LP solvers.
Is this task tractable?
1) Too many variables to minimize
2) Too many summation terms
3) Too many terms in expectations
4) We even can't store transition p-s
5) Exponential number of constraints
Background: Factored MDP
We need concise representation of the transition probabilities
Let:
And further:
Let's use Dynamical Bayesian Network representation
4) We even can't store transition probabilities
Background:
Approximate Linear Programming
1) Too many variables to minimize
2) Too many summation terms
Let
And let's denote
So
Where
- are some basis functions
Background:
Approximate Linear Programming
If we assume the same initial distribution factorization
We will get a new LP task:
Background:
Approximate Linear Programming
PROOF:
Background: ALP + Factored MDP
3) Too many terms in expectations
Constraints for the LP problem:
It may be rewriten as:
And even further decompose rewards as:
Background: ALP + Factored MDP
PROOF:
Background: Constraint Generation
5) Exponential number of constraints
- Solve master LP problem for a subset of constraints using GLOP. Get optimal \( w \) values
- Find a maximally violated constraint among ones, which was not added to master LP.
- Add it to the master LP if violation is positive, else Break
- Repeat
MVC search:
Use a black-box solver SCIP
for Mixed Integer Programming
In our case \( x, a \) - are Boolean vectors
Background: Logistic Regression
Text
Need more? Try Google, it's free
Logistic Regression
STATE
ACT
ION
RESPONSE
t
t+1
MDP
Logistic Markov Decision Processes
t
t+1
We allow response \( \phi^t \) to influence user's state at \( t+1 \) timestep
Factored Logistic MDP
Transition Dynamics:
Reward function:
Hence, our backprojections \( g_i \) now dependent on complete \( x \) and \( a \) vectors
Let's rewrite \( Q(x, a) \) as:
ALP for Logistic MDP
Let's denote:
Then ALP task may be reformulated as:
Constraints are now nonlinear since \( p(\phi|x,a) \) is nonlinear. MCV search is not MIP problem now
s.t.
ALP for Logistic MDP
We will denote
And
Constant Approximation
Where \(\sigma^*\) is some constant
We consider two subsets of possible \( (x, a) \) pairs:
where the constraint is non-decreasing with \(\sigma^*\) growth
where the constraint is non-increasing with \(\sigma^*\) growth
We denote by \(U^u\) the solution for
s.t.
And by \( U^l \) the solution with \( \sigma_l \) and \( H^- \) instead of \( \sigma_u \) and \( H^+ \)
Constant Approximation
- \( U^u \) is an upper bound on the maximal constraint violation (CV) in the subset \( (x,a) \in [f_l, f_u] \cap H^+ \)
- True CV in that point : \( C^u = C(x^u,a^u, w) \) is a lower bound on maximal CV in this subset.
Same thing for the \( U^l \) and \( C^l \) in the subset \( (x,a) \in [f_l,f_u] \cap H^- \)
Hence,
is an upper bound on MCV in \( [f_l, f_u] \)
is a lower bound on MCV in \( [f_l, f_u] \)
Constant Approximation
CV
\( \sigma \)
\( C(x^{(2)}, a^{(2)}, \sigma) \)
\( C(x^{(1)}, a^{(1)}, \sigma) \)
\( U^u \)
\( \sigma_l \)
\( \sigma_u \)
\( \sigma(f(x^{(2)}, a^{(2)})) \)
\( \sigma(f(x^{(1)}, a^{(1)})) \)
The degree of CV for two state-action pairs as a function of \( \sigma \)
MVC search in ALP-SEARCH
1) Solve two MIP tasks for some interval \( [f_l, f_u] \) :
s.t.
s.t.
2) If \( U^* < \epsilon \) then there are no constrain violation in \( [f_l, f_u] \) and we terminate
3) If \( U^* - C^* < \epsilon \) then we report that \( C^* \) is a MCV in \( [f_l, f_u] \) and we terminate
3) If \( C' \) in another interval is larger than \( C^* \) then we terminate
If nothing from above holds we divide interval into two and recursively repeat
Piece-Wise Constant approximation
A piece-wise constant approximation of the sigmoid
MVC search in ALP-APPROX
s.t.
where
Then we calculate:
- true CV, probably not a maximal one in \([\delta_{i-1}, \delta_i]\)
- estimation of the maximal CV in \([f_l, f_u]\)
Approximation error in ALP-APPROX
THEOREM
A bounded log-relative error for the logistic regression (assuming features with finite domains) can be achieved with \( O(\frac{1}{\epsilon} ||u||_1) \) intervals in logit space, where \(u\) is the logistic regression weight vector
THEOREM
Given an interval \( [a, b] \) in logit space, the value \(\sigma(x)\) with $$ x = \ln \frac{e^{a+b} + e^b}{1 + e^b} $$ minimizes the log-relative error over the interval.
Experiments
Text
- Advertising task
- Reward: 1 for click, 0 otherwise
- The aim is to maximize Cumulative Click-Through Rate (CCTR)
- Features are one-hot encoded
- Features are divided into three categories:
- User state (static or dynamic) - state variable
- Ad description - action variable
- User-Ad interaction - action variable
- Transition dynamics is simple: either identity function or Bernoulli distribution on moving to next bucket in feature domain
- Pretrained Logistic Regression on 300M of examples
Experiments
Text
Model sizes:
- Tiny:
- 2 state features (48 binarized)
- 1 action feature (7 binarized)
-
Small:
- 6 state features (71 binarized)
- 4 action feature (15 binarized)
-
Medium:
- 11 state features (251 binarized)
- 8 action feature (170 binarized)
- Large:
- 12 state features (2630 binarized)
- 11 action features (224 binarized)
Experiments
Experiments
Experiments
Extensions
Text
- Relaxation of the CG optimization
- Cross-product features
- Multiple response variables
- Partition-free CG
- Non-linear response model
Partition-free CG
which is equivalent to
s.t.
The simple idea is to iteratively alternate between two steps:
- maximize over \( x, a \) using MIP solver
- choose \( y = f(x,a) \)
But it will be stuck in local optima almost surely
Partition-free CG
Another approach is to consider Lagrangian relaxation:
Primal-Dual alternating optimization:
Initialize
for
do:
end for
Non-linear Response Model
Text
We consider wide-n-deep response model:
- some features from \(x\) and \(a\) are used as an input to the final logistic output unit
- some features are passed through DNN with several layers and several non-linear units
If we can express non-linearity in DNN in such way that the input to the final logistic output formulates as a linear-like function of \( (x, a) \) then CG optimization will be the same as for Logistic Regression response model.
ReLu non-linearity may be expressed in such way using just one or two indicator functions per unit
Thanks for your
attention!
Approximate Linear Programming for Markov Decision Processes
By cydoroga
Approximate Linear Programming for Markov Decision Processes
- 531