Approximate Linear Programming for Markov Decision Processes

report is made by
Pavel Temirchev

 

Deep RL

reading group

 

Motivation

  • We want to model Interaction with Users
  • User state - is the Environment state
  • Our actions are adds, recommendations, etc.
  • Myopic predictions - current approach
  • We want to maximize return for Long-Term interactions
  • And we want to use pretrained myopic models (such as Logistic Regression)
  • State-Action space is very large, usually discrete and sparse!

Contents​

  • Background​
    • MDP, Factored MDP
    • Approximate Linear Programming
    • Logistic Regression
  • Logistic MDP
    • Factored Logistic MDP
  • ALP for Logistic MDP
    • Exact Sequential Approach
    • Piece-Wise Constant Approximation
    • Error Analysis
  • Experiments
  • Extensions

Some remarks

  • Model-Based method
  • We do not model Transition Dynamics, it is given from sky
  • Not really RL
  • Not really Deep
  • Work is in the progress

Background: MDP

Text

V^{\pi}(x) = r_x^a + \gamma \sum_{x'} p(x'|x, a) V^{\pi}(x')
Vπ(x)=rxa+γxp(xx,a)Vπ(x)V^{\pi}(x) = r_x^a + \gamma \sum_{x'} p(x'|x, a) V^{\pi}(x')
Q^{\pi}(x,a) = r_x^a + \gamma \sum_{x'} p(x'|x, a) V^{\pi}(x')
Qπ(x,a)=rxa+γxp(xx,a)Vπ(x)Q^{\pi}(x,a) = r_x^a + \gamma \sum_{x'} p(x'|x, a) V^{\pi}(x')

where \( a = \pi(x) \)

\pi^*(x) = \arg\max_a Q^*(x, a)
π(x)=argmaxaQ(x,a)\pi^*(x) = \arg\max_a Q^*(x, a)
p(x'|x, a)
p(xx,a)p(x'|x, a)

is transition probabilities

Background: MDP

x \in X
xXx \in X

- is a finite discrete state space

a \in A
aAa \in A

- is a finite discrete action space

x_i \in Dom(X_i)
xiDom(Xi)x_i \in Dom(X_i)

- is a finite discrete domain on each feature in \(x\)

- is a finite discrete domain on each feature in \(a\)

a_i \in Dom(A_i)
aiDom(Ai)a_i \in Dom(A_i)

x, a are then onehot encoded

Background:

Linear Programming task for MDP

\min_{v} \sum_x \alpha(x)v(x)
minvxα(x)v(x) \min_{v} \sum_x \alpha(x)v(x)

s.t.

x \in X
xXx \in X
v(x) \geq Q^v(x, a) = r_x^a + \gamma \sum p(x'|x,a)v(x')
v(x)Qv(x,a)=rxa+γp(xx,a)v(x)v(x) \geq Q^v(x, a) = r_x^a + \gamma \sum p(x'|x,a)v(x')
\forall x \in X, \forall a \in A
xX,aA\forall x \in X, \forall a \in A

 We have LP solvers.

Is this task tractable?

1) Too many variables to minimize

2) Too many summation terms

3) Too many terms in expectations

4) We even can't store transition p-s

5) Exponential number of constraints 

Background: Factored MDP

We need concise representation of the transition probabilities

p(x' | x, a) = \prod_i p(x_i'|x,a)
p(xx,a)=ip(xix,a)p(x' | x, a) = \prod_i p(x_i'|x,a)

Let:

And further:

p(x' | x, a) = \prod_i p(x_i'|par_i)
p(xx,a)=ip(xipari)p(x' | x, a) = \prod_i p(x_i'|par_i)
par_i = (x[Par_i], a[Par_i])
pari=(x[Pari],a[Pari])par_i = (x[Par_i], a[Par_i])
Par_i \subseteq X \cup A
PariXAPar_i \subseteq X \cup A

Let's use Dynamical Bayesian Network representation

4) We even can't store transition probabilities

Background:

Approximate Linear Programming

\min_{v} \sum_x \alpha(x)v(x)
minvxα(x)v(x) \min_{v} \sum_x \alpha(x)v(x)

1) Too many variables to minimize

2) Too many summation terms

Let

v(x) := \sum_{i=0}^k w_i\beta_i(x)
v(x):=i=0kwiβi(x)v(x) := \sum_{i=0}^k w_i\beta_i(x)

And let's denote

x[B_i] = b_i, B_i \subseteq X
x[Bi]=bi,BiXx[B_i] = b_i, B_i \subseteq X

So

v(x) := \sum_{i=0}^k w_i\beta_i(b_i)
v(x):=i=0kwiβi(bi)v(x) := \sum_{i=0}^k w_i\beta_i(b_i)

Where

\beta_i
βi\beta_i

- are some basis functions

Background:

Approximate Linear Programming

If we assume the same initial distribution factorization

\alpha(x) := \prod_{i=0}^k \alpha(b_i)
α(x):=i=0kα(bi)\alpha(x) := \prod_{i=0}^k \alpha(b_i)

We will get a new LP task:

\min_w \sum_{i=0}^k \sum_{b_i} \alpha(b_i)w_i \beta_i(b_i)
minwi=0kbiα(bi)wiβi(bi)\min_w \sum_{i=0}^k \sum_{b_i} \alpha(b_i)w_i \beta_i(b_i)

Background:

Approximate Linear Programming

PROOF:

\sum_x \alpha(x)v(x) = \sum_x \alpha(x)\sum_iw_i\beta_i(x[B_i])=
xα(x)v(x)=xα(x)iwiβi(x[Bi])=\sum_x \alpha(x)v(x) = \sum_x \alpha(x)\sum_iw_i\beta_i(x[B_i])=
= \sum_{b_0,...,b_k} \big[\prod_{j=0}^k \alpha(b_j) \big]\sum_iw_i\beta_i(b_i) =
=b0,...,bk[j=0kα(bj)]iwiβi(bi)== \sum_{b_0,...,b_k} \big[\prod_{j=0}^k \alpha(b_j) \big]\sum_iw_i\beta_i(b_i) =
=\sum_i \sum_{b_0,...,b_k} \big[\prod_{j=0}^k \alpha(b_j) \big]w_i\beta_i(b_i) =
=ib0,...,bk[j=0kα(bj)]wiβi(bi)==\sum_i \sum_{b_0,...,b_k} \big[\prod_{j=0}^k \alpha(b_j) \big]w_i\beta_i(b_i) =
=\sum_i \sum_{b_i} \alpha(b_i)\big[\sum_{b_j:j\neq i} \prod_{j\neq i} \alpha(b_j) \big]w_i\beta_i(b_i)
=ibiα(bi)[bj:jijiα(bj)]wiβi(bi)=\sum_i \sum_{b_i} \alpha(b_i)\big[\sum_{b_j:j\neq i} \prod_{j\neq i} \alpha(b_j) \big]w_i\beta_i(b_i)
=\sum_i \sum_{b_i} \alpha(b_i)w_i\beta_i(b_i)
=ibiα(bi)wiβi(bi)=\sum_i \sum_{b_i} \alpha(b_i)w_i\beta_i(b_i)

Background: ALP + Factored MDP

v(x) \geq Q^v(x, a) = r_x^a +\gamma \sum p(x'|x,a)v(x')
v(x)Qv(x,a)=rxa+γp(xx,a)v(x)v(x) \geq Q^v(x, a) = r_x^a +\gamma \sum p(x'|x,a)v(x')

3) Too many terms in expectations

Constraints for the LP problem:

\sum_{i=0}^k w_i \big[\gamma g_i - \beta_i(b_i)\big] + r_x^a \leq 0
i=0kwi[γgiβi(bi)]+rxa0\sum_{i=0}^k w_i \big[\gamma g_i - \beta_i(b_i)\big] + r_x^a \leq 0
g_i = \sum_{b_i'}\beta_i(b_i')p(b_i'|par_{B_i})
gi=biβi(bi)p(biparBi)g_i = \sum_{b_i'}\beta_i(b_i')p(b_i'|par_{B_i})
par_{B_i} = \cup_{i:X_i\in B_i} par_i
parBi=i:XiBiparipar_{B_i} = \cup_{i:X_i\in B_i} par_i

It may be rewriten as:

And even further decompose rewards as:

r_x^a = \sum_{j=0}^r \rho_j(x[R_j], a[R_j])
rxa=j=0rρj(x[Rj],a[Rj])r_x^a = \sum_{j=0}^r \rho_j(x[R_j], a[R_j])

Background: ALP + Factored MDP

v(x) \geq r_x^a +\gamma \sum p(x'|x,a)v(x')
v(x)rxa+γp(xx,a)v(x)v(x) \geq r_x^a +\gamma \sum p(x'|x,a)v(x')
\sum_{i=0}^k w_i \big[\gamma g_i(par_{B_i}) - \beta_i(b_i)\big] + r_x^a \leq 0
i=0kwi[γgi(parBi)βi(bi)]+rxa0\sum_{i=0}^k w_i \big[\gamma g_i(par_{B_i}) - \beta_i(b_i)\big] + r_x^a \leq 0
g_i = \sum_{b_i'}\beta_i(b_i')p(b_i'|par_{B_i})
gi=biβi(bi)p(biparBi)g_i = \sum_{b_i'}\beta_i(b_i')p(b_i'|par_{B_i})

PROOF:

\sum_iw_i\beta_i(b_i) \geq r_x^a +
iwiβi(bi)rxa+\sum_iw_i\beta_i(b_i) \geq r_x^a +
+ \gamma \sum_{b_0',...,b_k'} \prod_{j=0}^k p(b_j'|par_{B_j})\sum_iw_i \beta_i(b_i')
+γb0,...,bkj=0kp(bjparBj)iwiβi(bi)+ \gamma \sum_{b_0',...,b_k'} \prod_{j=0}^k p(b_j'|par_{B_j})\sum_iw_i \beta_i(b_i')
\sum_iw_i\beta_i(b_i) \geq r_x^a +
iwiβi(bi)rxa+\sum_iw_i\beta_i(b_i) \geq r_x^a +
+ \gamma \sum_i \big[\sum_{b_j':j\neq i} \prod_{j\neq i} p(b_j'|par_{B_j})\big] \sum_{b_i'} p(b_i'|par_{B_i}) w_i \beta_i(b_i')
+γi[bj:jijip(bjparBj)]bip(biparBi)wiβi(bi)+ \gamma \sum_i \big[\sum_{b_j':j\neq i} \prod_{j\neq i} p(b_j'|par_{B_j})\big] \sum_{b_i'} p(b_i'|par_{B_i}) w_i \beta_i(b_i')
\cdot
\cdot
\cdot
\cdot
\cdot
\cdot

Background: Constraint Generation

5) Exponential number of constraints 

  • Solve master LP problem for a subset of constraints using GLOP. Get optimal \( w \) values
  • Find a maximally violated constraint among ones, which was not added to master LP.
  • Add it to the master LP if violation is positive, else Break
  • Repeat
\max_{x, a} \sum_{i=0}^k w_i \big[\gamma g_i(par_{B_i}) - \beta_i(b_i)\big] + r_x^a
maxx,ai=0kwi[γgi(parBi)βi(bi)]+rxa\max_{x, a} \sum_{i=0}^k w_i \big[\gamma g_i(par_{B_i}) - \beta_i(b_i)\big] + r_x^a

MVC search:

Use a black-box solver SCIP

for Mixed Integer Programming

In our case \( x, a \) - are Boolean vectors

Background: Logistic Regression

Text

p(\phi = 1 | x, a) = \sigma (x^T u_x + a^T u_a)
p(ϕ=1x,a)=σ(xTux+aTua)p(\phi = 1 | x, a) = \sigma (x^T u_x + a^T u_a)
\sigma(x) = \frac{1}{1 + \exp(-x)}
σ(x)=11+exp(x)\sigma(x) = \frac{1}{1 + \exp(-x)}

Need more? Try Google, it's free

Logistic Regression

\phi
ϕ\phi
X_1
X1X_1
X_n
XnX_n
A_1
A1A_1
A_m
AmA_m

STATE

ACT

ION

RESPONSE

X_1
X1X_1
X_n
XnX_n
A_1
A1A_1
A_m
AmA_m
X_1
X1X_1
X_n
XnX_n
A_1
A1A_1
A_m
AmA_m

t

t+1

MDP

Logistic Markov Decision Processes

\phi
ϕ\phi
X_1
X1X_1
X_n
XnX_n
A_1
A1A_1
A_m
AmA_m
X_1
X1X_1
X_n
XnX_n
A_1
A1A_1
A_m
AmA_m

t

t+1

p(x^{t+1}| x^{t}, a^t) =
p(xt+1xt,at)=p(x^{t+1}| x^{t}, a^t) =
=\mathbb{E}_{p(\phi|x,a)} p(x^{t+1}| x^{t}, a^t,\phi^t)
=Ep(ϕx,a)p(xt+1xt,at,ϕt)=\mathbb{E}_{p(\phi|x,a)} p(x^{t+1}| x^{t}, a^t,\phi^t)

We allow response \( \phi^t \) to influence user's state at \( t+1 \)  timestep

Factored Logistic MDP

p(x^{t+1}|x^t, a^t) = \sum_{\phi\in\{0,1\}}\prod_ip(x^{t+1}_i|par_i,\phi)p(\phi|x^t,a^t)
p(xt+1xt,at)=ϕ{0,1}ip(xit+1pari,ϕ)p(ϕxt,at)p(x^{t+1}|x^t, a^t) = \sum_{\phi\in\{0,1\}}\prod_ip(x^{t+1}_i|par_i,\phi)p(\phi|x^t,a^t)

Transition Dynamics:

Reward function:

r^x_a(\phi) = \sum_{j=0}^r \rho_j(x[R_j],a[R_j],\phi)
rax(ϕ)=j=0rρj(x[Rj],a[Rj],ϕ)r^x_a(\phi) = \sum_{j=0}^r \rho_j(x[R_j],a[R_j],\phi)

Hence, our backprojections \( g_i \) now dependent on complete \( x \) and \( a \) vectors

Let's rewrite \( Q(x, a) \) as:

Q(x, a) = \mathbb{E}_{\phi} \big[ r^x_a(\phi) + \gamma\sum_iw_ig_i(par_{B_i}, \phi) \big]
Q(x,a)=Eϕ[rax(ϕ)+γiwigi(parBi,ϕ)]Q(x, a) = \mathbb{E}_{\phi} \big[ r^x_a(\phi) + \gamma\sum_iw_ig_i(par_{B_i}, \phi) \big]

ALP for Logistic MDP

Let's denote:

h(x,a,\phi,w) = \sum_j^r \rho(x[R_j],a[R_j],\phi) +
h(x,a,ϕ,w)=jrρ(x[Rj],a[Rj],ϕ)+h(x,a,\phi,w) = \sum_j^r \rho(x[R_j],a[R_j],\phi) +
+ \sum_i w_i\big(\gamma g_i (par_{B_i},\phi)-\beta_i(b_i) \big)
+iwi(γgi(parBi,ϕ)βi(bi))+ \sum_i w_i\big(\gamma g_i (par_{B_i},\phi)-\beta_i(b_i) \big)

Then ALP task may be reformulated as:

\min_w \sum_i \sum_{b_i} w_i \alpha(b_i)\beta(b_i)
minwibiwiα(bi)β(bi)\min_w \sum_i \sum_{b_i} w_i \alpha(b_i)\beta(b_i)
0 \geq C(x,a,w)
0C(x,a,w)0 \geq C(x,a,w)
\forall x \in X, \forall a \in A
xX,aA\forall x \in X, \forall a \in A

Constraints are now nonlinear since \( p(\phi|x,a) \) is nonlinear. MCV search is not MIP problem now

s.t.

C(x,a,w) = \sum_{\phi \in \{0,1\}}p(\phi|x, a) h(x,a,\phi,w)
C(x,a,w)=ϕ{0,1}p(ϕx,a)h(x,a,ϕ,w)C(x,a,w) = \sum_{\phi \in \{0,1\}}p(\phi|x, a) h(x,a,\phi,w)

ALP for Logistic MDP

\max_{x,a} \sum_{\phi \in \{0,1\}}p(\phi|x, a) h(x,a,\phi,w)
maxx,aϕ{0,1}p(ϕx,a)h(x,a,ϕ,w)\max_{x,a} \sum_{\phi \in \{0,1\}}p(\phi|x, a) h(x,a,\phi,w)

We will denote

p(\phi|x,a) = \sigma(f(x,a))
p(ϕx,a)=σ(f(x,a))p(\phi|x,a) = \sigma(f(x,a))

And

[f_l, f_u] = \{ (x,a) : f_l \leq f(x,a) \leq f_u \}
[fl,fu]={(x,a):flf(x,a)fu}[f_l, f_u] = \{ (x,a) : f_l \leq f(x,a) \leq f_u \}
\sigma_l = \sigma(f_l)
σl=σ(fl)\sigma_l = \sigma(f_l)
\sigma_u = \sigma(f_u)
σu=σ(fu)\sigma_u = \sigma(f_u)

Constant Approximation

\max_{x,a} \sigma^* h(x,a,\phi=1) +(1 - \sigma^*) h(x,a,\phi=0)
maxx,aσh(x,a,ϕ=1)+(1σ)h(x,a,ϕ=0)\max_{x,a} \sigma^* h(x,a,\phi=1) +(1 - \sigma^*) h(x,a,\phi=0)
H^+ = \{ (x,a): h(x,a,\phi=1) - h(x,a,\phi=0) \geq 0\}
H+={(x,a):h(x,a,ϕ=1)h(x,a,ϕ=0)0}H^+ = \{ (x,a): h(x,a,\phi=1) - h(x,a,\phi=0) \geq 0\}

Where \(\sigma^*\) is some constant

We consider two subsets of possible \( (x, a) \) pairs:

where the constraint is non-decreasing with \(\sigma^*\) growth

H^- = \{ (x,a): h(x,a,\phi=1) - h(x,a,\phi=0) < 0\}
H={(x,a):h(x,a,ϕ=1)h(x,a,ϕ=0)<0}H^- = \{ (x,a): h(x,a,\phi=1) - h(x,a,\phi=0) < 0\}

where the constraint is non-increasing with \(\sigma^*\) growth

We denote by \(U^u\) the solution for

\max_{x,a} \sigma_u h(x,a,\phi=1) +(1 - \sigma_u) h(x,a,\phi=0)
maxx,aσuh(x,a,ϕ=1)+(1σu)h(x,a,ϕ=0)\max_{x,a} \sigma_u h(x,a,\phi=1) +(1 - \sigma_u) h(x,a,\phi=0)

s.t.

(x,a)\in [f_l, f_u] \cap H^+
(x,a)[fl,fu]H+(x,a)\in [f_l, f_u] \cap H^+

And by \( U^l \) the solution with \( \sigma_l \) and \( H^- \) instead of \( \sigma_u \) and \( H^+ \)

Constant Approximation

  • \( U^u \) is an upper bound on the maximal constraint violation (CV) in the subset \( (x,a) \in [f_l, f_u] \cap H^+ \)
  • True CV in that point : \( C^u = C(x^u,a^u, w) \) is a lower bound on maximal CV in this subset.

Same thing for the \( U^l \) and \( C^l \) in the subset \( (x,a) \in [f_l,f_u] \cap H^- \)

Hence,

U^* = max(U^l, U^u)
U=max(Ul,Uu)U^* = max(U^l, U^u)
C^* = max(C^l, C^u)
C=max(Cl,Cu)C^* = max(C^l, C^u)

is an upper bound on MCV in \( [f_l, f_u] \)

is a lower bound on MCV in \( [f_l, f_u] \)

Constant Approximation

CV

\( \sigma \)

\cdot
\cdot
\cdot
\cdot
\cdot
\cdot

\( C(x^{(2)}, a^{(2)}, \sigma) \)

\( C(x^{(1)}, a^{(1)}, \sigma) \)

\( U^u \)

\( \sigma_l \)

\( \sigma_u \)

\( \sigma(f(x^{(2)}, a^{(2)})) \)

\( \sigma(f(x^{(1)}, a^{(1)})) \)

The degree of CV for two state-action pairs as a function of \( \sigma \) 

MVC search in ALP-SEARCH

1) Solve two MIP tasks for some interval \( [f_l, f_u] \) :

\max_{x,a} \sigma_u h(x,a,\phi=1) +(1 - \sigma_u) h(x,a,\phi=0)
maxx,aσuh(x,a,ϕ=1)+(1σu)h(x,a,ϕ=0)\max_{x,a} \sigma_u h(x,a,\phi=1) +(1 - \sigma_u) h(x,a,\phi=0)

s.t.

(x,a)\in [f_l, f_u] \cap H^+
(x,a)[fl,fu]H+(x,a)\in [f_l, f_u] \cap H^+
\max_{x,a} \sigma_l h(x,a,\phi=1) +(1 - \sigma_l) h(x,a,\phi=0)
maxx,aσlh(x,a,ϕ=1)+(1σl)h(x,a,ϕ=0)\max_{x,a} \sigma_l h(x,a,\phi=1) +(1 - \sigma_l) h(x,a,\phi=0)

s.t.

(x,a)\in [f_l, f_u] \cap H^-
(x,a)[fl,fu]H(x,a)\in [f_l, f_u] \cap H^-

2) If \( U^* < \epsilon \) then there are no constrain violation in \( [f_l, f_u] \) and we terminate

3) If \( U^* - C^* < \epsilon \) then we report that \( C^* \) is a MCV in \( [f_l, f_u] \) and we terminate

3) If \( C' \) in another interval is larger than \( C^* \) then we terminate

If nothing from above holds we divide interval into two and recursively repeat

Piece-Wise Constant approximation

A piece-wise constant approximation of the sigmoid

MVC search in ALP-APPROX

\max_{x,a} \sigma_i h(x,a,\phi=1) +(1 - \sigma_i) h(x,a,\phi=0)
maxx,aσih(x,a,ϕ=1)+(1σi)h(x,a,ϕ=0)\max_{x,a} \sigma_i h(x,a,\phi=1) +(1 - \sigma_i) h(x,a,\phi=0)
(x,a) \in [\delta_{i-1}, \delta_i]
(x,a)[δi1,δi](x,a) \in [\delta_{i-1}, \delta_i]

s.t.

where

\sigma_i = \sigma(f_i), f_i: \delta_{i-1} \leq f_i \leq \delta_i
σi=σ(fi),fi:δi1fiδi\sigma_i = \sigma(f_i), f_i: \delta_{i-1} \leq f_i \leq \delta_i

Then we calculate:

C^i = C(x^i, a^i , \sigma(x^i,a^i))
Ci=C(xi,ai,σ(xi,ai))C^i = C(x^i, a^i , \sigma(x^i,a^i))

- true CV, probably not a maximal one in \([\delta_{i-1}, \delta_i]\)

C^* = \max_iC^i
C=maxiCiC^* = \max_iC^i

- estimation of the maximal CV in \([f_l, f_u]\)

Approximation error in ALP-APPROX

THEOREM

A bounded log-relative error for the logistic regression (assuming features with finite domains) can be achieved with \( O(\frac{1}{\epsilon} ||u||_1) \) intervals in logit space, where \(u\) is the logistic regression weight vector

THEOREM

Given an interval \( [a, b] \) in logit space, the value \(\sigma(x)\) with $$ x = \ln \frac{e^{a+b} + e^b}{1 + e^b} $$ minimizes the log-relative error over the interval.

Experiments

Text

  • Advertising task
  • Reward: 1 for click, 0 otherwise
  • The aim is to maximize Cumulative Click-Through Rate (CCTR)
  • Features are one-hot encoded
  • Features are divided into three categories:
    • User state (static or dynamic) - state variable
    • Ad description - action variable
    • User-Ad interaction - action variable
  • Transition dynamics is simple: either identity function or Bernoulli distribution on moving to next bucket in feature domain
  • Pretrained Logistic Regression on 300M of examples

Experiments

Text

Model sizes:

  • Tiny:
    • 2 state features (48 binarized)
    • 1 action feature (7 binarized)
  • Small:
    • 6 state features (71 binarized)
    • 4 action feature (15 binarized)
  • Medium:
    • 11 state features (251 binarized)
    • 8 action feature (170 binarized)
  • Large:
    • 12 state features (2630 binarized)
    • 11 action features (224 binarized)

Experiments

Experiments

Experiments

Extensions

Text

  • Relaxation of the CG optimization
  • Cross-product features
  • Multiple response variables
  • Partition-free CG
  • Non-linear response model

Partition-free CG

\max_{x,a} \sigma(f(x,a) )h(x,a,\phi=1) +\sigma(-f(x,a)) h(x,a,\phi=0)
maxx,aσ(f(x,a))h(x,a,ϕ=1)+σ(f(x,a))h(x,a,ϕ=0)\max_{x,a} \sigma(f(x,a) )h(x,a,\phi=1) +\sigma(-f(x,a)) h(x,a,\phi=0)

which is equivalent to

\max_{x,a,y} \sigma(y)h(x,a,\phi=1) +\sigma(-y) h(x,a,\phi=0)
maxx,a,yσ(y)h(x,a,ϕ=1)+σ(y)h(x,a,ϕ=0)\max_{x,a,y} \sigma(y)h(x,a,\phi=1) +\sigma(-y) h(x,a,\phi=0)
y=f(x,a)
y=f(x,a)y=f(x,a)

s.t.

The simple idea is to iteratively alternate between two steps:

  • maximize over \( x, a \) using MIP solver
  • choose \( y = f(x,a) \)

But it will be stuck in local optima almost surely

Partition-free CG

\min_{\lambda}\max_{x,a,y} \sigma(y)h(x,a,\phi=1) +\sigma(-y) h(x,a,\phi=0) -
minλmaxx,a,yσ(y)h(x,a,ϕ=1)+σ(y)h(x,a,ϕ=0)\min_{\lambda}\max_{x,a,y} \sigma(y)h(x,a,\phi=1) +\sigma(-y) h(x,a,\phi=0) -

Another approach is to consider Lagrangian relaxation:

- \lambda f(x,a) + \lambda y
λf(x,a)+λy- \lambda f(x,a) + \lambda y

Primal-Dual alternating optimization:

Initialize

\lambda, x, a, y
λ,x,a,y\lambda, x, a, y

for

t:= 1,...,T
t:=1,...,Tt:= 1,...,T

do:

end for

1)
1)1)
2)
2)2)
y^{(t+1)} = y^{(t)} + \eta_t \nabla^{(t)}_y
y(t+1)=y(t)+ηty(t)y^{(t+1)} = y^{(t)} + \eta_t \nabla^{(t)}_y
(x,a)^{(t+1)}=\arg\max_{x,a} \sigma(y^{(t+1)})h(x,a,\phi=1)+
(x,a)(t+1)=argmaxx,aσ(y(t+1))h(x,a,ϕ=1)+(x,a)^{(t+1)}=\arg\max_{x,a} \sigma(y^{(t+1)})h(x,a,\phi=1)+
+\sigma(-y^{(t+1)})h(x,a,\phi=0)
+σ(y(t+1))h(x,a,ϕ=0)+\sigma(-y^{(t+1)})h(x,a,\phi=0)
\lambda^{(t+1)} = \lambda^{(t)} - \hat{\eta}_t \big[ y^{(t+1)} - f((x,a)^{(t+1)}) \big]
λ(t+1)=λ(t)η^t[y(t+1)f((x,a)(t+1))]\lambda^{(t+1)} = \lambda^{(t)} - \hat{\eta}_t \big[ y^{(t+1)} - f((x,a)^{(t+1)}) \big]
3)
3)3)
4)
4)4)
5)
5)5)
6)
6)6)

Non-linear Response Model

Text

We consider wide-n-deep response model:

  • some features from \(x\) and \(a\) are used as an input to the final logistic output unit
  • some features are passed through DNN with several layers and several non-linear units

If we can express non-linearity in DNN in such way that the input to the final logistic output formulates as a linear-like function of \( (x, a) \) then CG optimization will be the same as for Logistic Regression response model.

 

ReLu non-linearity may be expressed in such way using just one or two indicator functions per unit

Thanks for your

attention!

Approximate Linear Programming for Markov Decision Processes

By cydoroga

Approximate Linear Programming for Markov Decision Processes

  • 531