Sarah Dean PRO
asst prof in CS at Cornell
Prof Sarah Dean
model
ft:X→Y
observation
prediction
xt
accumulate
{(xt,yt)}
y^t
policy
πt:X→A
observation
action
xt
accumulate
{(xt,at,rt)}
at
Linear Contextual Bandits
Linear Contextual Bandits
Related Goals:
The prediction error (θ^t−θ⋆)⊤φ=∑k=1tεk(Vt−1φk)⊤φ∼N(0,σ2∥φ∥Vt−12φ⊤Vt−1φ)
How much should I trust my predicted reward θ^⊤φ if observed rewards are corrupted by Gaussian noise?
θ^t=Vt−1k=1∑tφkrk,Vt=k=1∑tφkφk⊤
With probability 1−δ, we have ∣(θ^t−θ⋆)⊤φ∣≤ σ2log(2/δ)∥φ∥Vt−1
Correction: last lecture, expressions like ∥Mx∥2 should have instead been x⊤Mx=∥M1/2x∥2=:∥M∥V
For symmetric matrices with non-negative eigenvalues (i.e. "positive semi-definite matrices"), we know that M=VΛV⊤
So the "square root" of a PSD matrix is M1/2=Λ1/2V⊤
For a diagonal matrix D1/2=diag(λ1,…,λn)
where notation M1/2 means a matrix such that (M1/2)⊤M1/2=M
How much should I trust the estimate θ^?
Define the confidence ellipsoid Ct={θ∈Rd∣∥θ−θ^t∥Vt2≤βt}
For the right choice of βt, it is possible to guarantee θ⋆∈Ct with high probability
Exercise: For a fixed action φ, show that θ∈Ctmaxθ⊤φ≤θ^⊤φ+βt∥φ∥Vt−1
θ∈Ctminθ⊤φ≥θ^⊤φ−βt∥φ∥Vt−1
example: K=2 and we've pulled the arms 2 and 1 times respectively
Vt=[21]
(μ^1,μ^2)
Confidence set
2(μ1−μ^1)2+(μ2−μ^2)2≤βt
Pulling arm 1: μ^1±βt/2
Pulling arm 2: μ^2±βt
{[10],[10],[01]}
example: d=2 linear bandits
Vt=[3113]
θ^
Trying action a=[0,1]:
{[11],[−11],[−1−1]}
Vt−1=81[3−1−13]
Exercise: For fixed φ, show that best/worst case elements of Ct are given by θ=±∥a∥Vt−1βtVt−1φ
Now we have
Confidence ellipsoid takes the same form Ct={θ∈Rd∣∥θ−θ^t∥Vt2≤βt}
θ^t=Vt−1k=1∑tφkrk,Vt=λI+k=1∑tφkφk⊤
θ^t=argθmink=1∑t(θ⊤φk−rk)2+λ∥θ∥22
To handle cases where {φk} are not full rank, we consider regularized LS
How does estimation error affect suboptimality?
The suboptimality is θ⋆⊤φt⋆−θ⋆⊤φ^t
The suboptimality is θ⋆⊤φt⋆−θ⋆⊤φ^t
This perspective motivates techniques in experiment design
When {φk}k=1N are chosen at random from "nice" distribution
∥V−1∥=λmin(V)1≲Nd with high probability.
Then with high probability, the suboptimality
Informal Theorem: Let the norm of all φ∈At be bounded by B for all t. Suppose that {φk}k=1N are chosen at random from "nice" distribution.
For a fixed interaction horizon T, how to trade-off exploration and exploitation?
Design algorithms with low regret R(T)=t=1∑Ta∈Amaxr(xt,a)−r(xt,at)
R(T)=t=1∑Tφ∈Atmaxθ⋆⊤φ−θ⋆⊤φt
ETC
The regret has two components
R(T)=R1t=1∑N φ∈Atmaxθ⋆⊤φ−θ⋆⊤φt+R2t=N+1∑T φ∈Atmaxθ⋆⊤φ−θ⋆⊤φt
The regret has two components
R(T)=R1t=1∑N φ∈Atmaxθ⋆⊤φ−θ⋆⊤φt+R2t=N+1∑T φ∈Atmaxθ⋆⊤φ−θ⋆⊤φt
Suppose that B bounds the norm of φ and ∥θ⋆∥≤1.
Then we have R1≤2BN
The regret has two components
R(T)=R1t=1∑N φ∈Atmaxθ⋆⊤φ−θ⋆⊤φt+R2t=N+1∑T φ∈Atmaxθ⋆⊤φ−θ⋆⊤φt
Using sub-optimality result, with high probability R2≲(T−N)2BβNd
Suppose that maxtβt≤β.
The regret is bounded with high probability by
R(T)≲ 2BN+ 2BTβNd
Choosing N=T2/3 leads to sublinear regret
Adaptive perspective: optimism in the face of uncertainty
Instead of exploring randomly, focus exploration on promising actions
UCB
The suboptimality is θ⋆⊤φt⋆−θ⋆⊤φ^t
Proof Sketch: R(T)=∑t=1Tθ⋆⊤φt⋆−θ⋆⊤φ^t
The regret is bounded with high probability by
R(T)≲ T
After fall break: action in a dynamical world (optimal control)
Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari
By Sarah Dean