Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

Artyom Sorokin\(^{1,2}\) Nazar Buzun\(^{1,3}\) Alexander Anokhin\(^{2}\) Egor Vedernikov\(^{2}\) Petr Anokhin\(^{1}\) Mikhail Burtsev\(^{4}\) Evgeny Burnaev\(^{1,2}\)

\(^1\)AXXX, Moscow, Russia

\(^2\)Applied AI Institute, Moscow, Russia,

\(^3\)Innopolis University, Innopolis, Russia
\(^4\)London Institute for Mathematical Sciences, London, UK

Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

\(^1\)AXXX, Moscow, Russia

\(^2\)Applied AI Institute, Moscow, Russia

\(^3\)Innopolis University, Innopolis, Russia
\(^4\)London Institute for Mathematical Sciences, London, UK

Artyom Sorokin\(^{1,2}\)
Nazar Buzun\(^{1,3}\)
Alexander Anokhin\(^{2}\)
Egor Vedernikov\(^{2}\)
Petr Anokhin\(^{1}\)
Mikhail Burtsev\(^{4}\)
Evgeny Burnaev\(^{1,2}\)

Motivation

Long-Context is still challenging for Large Language Models

Common approaches to addressing the long-context challenge:

SSM, Transformers with Recurrence, Linear Attention
- Can't be combined with best LLMs
Knowledge Graph building Agents
- Slow inference time for long context processing
Multi-step RAG and RAG Agents
- Expensive LLM finetuning for multi-step RAG

Motivation

Long-Context is still challenging for Large Language Models

Common approaches to addressing the long-context challenge:

SSM, Transformers with Recurrence, Linear Attention
- Can't be combined with best LLMs
Knowledge Graph building Agents
- Slow inference time for long context processing
Multi-step RAG and RAG Agents
- Expensive LLM finetuning for multi-step RAG

Motivation

Long-Context is still challenging for Large Language Models

Common approaches to addressing the long-context challenge:

SSM, Transformers with Recurrence, Linear Attention
- Can't be combined with best LLMs
Agents that build Knowledge Graphs
- Slow inference for long context processing
Multi-step RAG and RAG agents

Motivation

Popular direction for multi-step RAG is to fine-tune an LLM to use retrieval as a tool

But fine-tuning LLMs can be expensive :(

Q-RAG: Main Idea

Main Idea:

Train Embedder instead of LLM to multi-step search tasks
Search queries are generated directly in embedding space
Formulate multi-step RAG as an MDP

Q-RAG: retrieval as RL problem

Multi-step retrieval as RL problem:

State: query + retrieved information.
Actions: chunks available for retrieval
Reward: 1.0 if all required chunks are found
Termination: exausting retrieval budget

\texttt{Timestep 0}

\texttt{Timestep 1}

Q-RAG: Training

We train embedders to approximate \(Q\)-function with Inner product beween state embedding and chunk embedding:

\langle \textcolor{blue}{E_s}(s; \textcolor{black}{\theta_1)}, \textcolor{green}{E_a}(a^i, i; \theta_2) \rangle = Q_\theta(s, a^i) \approx Q^{*}(s, a^i)

NCE training: \(\langle s , a \rangle \rightarrow\) semantic similarity
Q-value approximation: \(\langle s , a \rangle \rightarrow\) usefulness of \(a\) given query \(s = prompt(q, a_{t-1}..., a_{0})\)

Max entropy value functions that encourage exploration:

\begin{align*} Q^{\pi}(s,a)&= r(s,a) + \gamma V^{\pi}(s'=p(s,a)) \\ V^{\pi}(s) &= \mathbb{E}_{a \sim \pi(\cdot|s)} \left[ Q^{\pi}(s,a) - \alpha \log \pi(a|s) \right] \end{align*}

Given \(Q_\theta\), the chunk selection probability is computed using a Boltzmann policy:

\begin{equation*} \pi(a_t|s_t) = \frac{\exp \frac{1}{\alpha} \left( Q_\theta(s_t,a_t) - q \right) }{\sum_{a\in\mathcal{A}_t}\exp \frac{1}{\alpha} (Q_\theta(s_t,a) - q)} \end{equation*}

q=\max_{a\in A_t} Q_\theta(s_t,a)

, where