Presented by:
Prabhas Reddy Onteru
Guided by Prof.Ambedkar Dukkipati
June 16, 2025
Background
Problem Statement
Methodology
Experiments
Results
Conclusion
\(\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, r, \rho, \gamma)\)
In practice, learned policies may be sub-optimal, and agents are often allowed limited, cost-constrained interactions with the environment to collect additional trajectories.
For instance, in autonomous vehicles, the budget may correspond to fuel consumption, while for robotic agents, it may involve constraints such as time and effort.
\[\log(\sigma(v \cdot v^+)) + \log(1 - \sigma(v \cdot v^-)) - \lambda \| \hat{v}^+ - v^+ \|^2,\]
\[\text{Uncertainty}(s_i) = \max_{k, k'} \left\| E^k_s(s_i) - E^{k'}_s(s_i) \right\|_2.\]
a) Offline dataset b) Uncertanity map c) BaseLine d) OURS
legged 4 × 3 DOF quadrupedal robot.
3. This leads to the following optimization problem:
\[\max_{C_{\text{sub}} \subseteq C} \sum_{s_i \in C_{\text{sub}}} U_i \quad \text{s.t.} collection \leq B\]
4. This formulation is analogous to the classical 0/1 knapsack problem.
5. This optimization problem can be solved using Dynamic programming resulting in \( C_{\text{sub}}\) candidate states.
a) Offlinedata b) RandomCollection c) GreedyCollection d) OURS
maze2d-medium maze2d-hard
antmaze-play hopper -medium
\[E'_{\text{Cost}}(s, s_i) = E_{\text{Cost}}(s, s_i) + \frac{\alpha}{E_{\text{Cost}}(s_i, c_k) + \varepsilon}\]
antmaze-medium maze2d-hard
BaseLine OURS
BaseLine OURS
We introduced a novel data acquisition framework for offline RL where the objective is to collect informative trajectories while effectively utilizing the budget provided to the agent in the environment.
Unlike traditional methods, our approach jointly solves selection and planning, enabling efficient trajectory collection in optimal sequence.
Empirical results show consistent policy improvement across environments, demonstrating effectiveness even under non-uniform cost settings.