Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

Artyom Sorokin\(^{1,2}\) Nazar Buzun\(^{1,3}\) Alexander Anokhin\(^{2}\) Egor Vedernikov\(^{2}\) Petr Anokhin\(^{1}\) Mikhail Burtsev\(^{4}\) Evgeny Burnaev\(^{1,2}\)

\(^1\)AXXX, Moscow, Russia

\(^2\)Applied AI Institute, Moscow, Russia,  

\(^3\)Innopolis University,  Innopolis, Russia
\(^4\)London Institute for Mathematical Sciences, London, UK

Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

\(^1\)AXXX, Moscow, Russia

\(^2\)Applied AI Institute, Moscow, Russia

\(^3\)Innopolis University,  Innopolis, Russia
\(^4\)London Institute for Mathematical Sciences, London, UK

Artyom Sorokin\(^{1,2}\)
Nazar Buzun\(^{1,3}\) 
Alexander Anokhin\(^{2}\) 
Egor Vedernikov\(^{2}\) 
Petr Anokhin\(^{1}\) 
Mikhail Burtsev\(^{4}\) 
Evgeny Burnaev\(^{1,2}\)

Motivation

Long-Context is still challenging for Large Language Models

Common approaches to addressing the long-context challenge:

  • SSM, Transformers with Recurrence, Linear Attention
    • Can't be combined with best LLMs
  • Knowledge Graph building Agents
    • Slow inference time for long context processing
  • Multi-step RAG and RAG Agents
    • Expensive LLM finetuning for multi-step RAG

Motivation

Long-Context is still challenging for Large Language Models

Common approaches to addressing the long-context challenge:

  • SSM, Transformers with Recurrence, Linear Attention
    • Can't be combined with best LLMs
  • Knowledge Graph building Agents
    • Slow inference time for long context processing
  • Multi-step RAG and RAG Agents
    • Expensive LLM finetuning for multi-step RAG

Motivation

Long-Context is still challenging for Large Language Models

Common approaches to addressing the long-context challenge:

  • SSM, Transformers with Recurrence, Linear Attention
    • Can't be combined with best LLMs
  • Agents that build Knowledge Graphs
    • Slow inference  for long context processing
  • Multi-step RAG and RAG agents

Motivation

Popular direction for multi-step RAG is to fine-tune an LLM to use retrieval as a tool

But  fine-tuning LLMs can be expensive :(

Q-RAG: Main Idea

Main Idea:

  • Train Embedder instead of LLM to multi-step search tasks
  • Search queries are generated directly in embedding space
  • Formulate multi-step RAG as an MDP

 

Q-RAG: retrieval as RL problem

Multi-step retrieval as RL problem:

  • State: query + retrieved information.
  • Actions: chunks available for retrieval
  • Reward: 1.0 if all required chunks are found
  • Termination: exausting retrieval budget

 

\texttt{Timestep 0}
\texttt{Timestep 1}

Q-RAG: Training

We train embedders to approximate \(Q\)-function with Inner product beween state embedding and chunk embedding:

\langle \textcolor{blue}{E_s}(s; \textcolor{black}{\theta_1)}, \textcolor{green}{E_a}(a^i, i; \theta_2) \rangle = Q_\theta(s, a^i) \approx Q^{*}(s, a^i)

NCE training: \(\langle s , a \rangle \rightarrow\) semantic similarity
Q-value approximation: \(\langle s , a \rangle \rightarrow\) usefulness of \(a\) given query \(s = prompt(q, a_{t-1}..., a_{0})\)

Max entropy value functions that encourage exploration:

 

\begin{align*} Q^{\pi}(s,a)&= r(s,a) + \gamma V^{\pi}(s'=p(s,a)) \\ V^{\pi}(s) &= \mathbb{E}_{a \sim \pi(\cdot|s)} \left[ Q^{\pi}(s,a) - \alpha \log \pi(a|s) \right] \end{align*}

Given \(Q_\theta\), the chunk selection probability is computed using a Boltzmann policy:

\begin{equation*} \pi(a_t|s_t) = \frac{\exp \frac{1}{\alpha} \left( Q_\theta(s_t,a_t) - q \right) }{\sum_{a\in\mathcal{A}_t}\exp \frac{1}{\alpha} (Q_\theta(s_t,a) - q)} \end{equation*}
q=\max_{a\in A_t} Q_\theta(s_t,a)

,       where

Q-RAG: Architecture

Additional Q-RAG Details:

  • PQN Backbone:
    • No Replay Buffer
  • Separate embedders for states and chunks/actions
  • Adding Target Networks improves stability
  • \(\lambda\)-return used as target for Q-function

Q-RAG: Architecture

Additional Q-RAG Details:

  • PQN Backbone:
    • No Replay Buffer
  • Adding Target Networks improves stability
  • \(\lambda\)-return used as target for Q-function

Relative Positional Encoding

s_{1} = \texttt{[query, chunk\#40]}
s_{2} = \texttt{[query, chunk\#15, chunk\#40]}
a_{1} = \texttt{chunk\#15}

Maps each chunk index to a continuous value, which is injected into the chunk embedding through RoPE.

  • Captures position within the document
  • Captures position relative to previously retrieved chunks

Properties:

  • Number of chunks doesn’t influence max positional value
  • Max positional value depends only on number of retrieval steps

 

Relative Positional Encoding

s_{1} = \texttt{[query, chunk\#40]}
s_{2} = \texttt{[query, chunk\#15, chunk\#40]}
a_{1} = \texttt{chunk\#15}

Q-RAG assigns each candidate chunk a continuous relative position \(\rho_t(i)\):

 

 

\langle E_s(s), E_a(a^i)\rangle \Rightarrow \big\langle E_{s}(s),\, \textcolor{purple}{R}_{\textcolor{red}{\rho_{t}(i)}}\, E_a(a^i)\big\rangle
\rho_t(i) \;=\; \textcolor{black}{j}\,\delta \;+\; \ell\,\frac{i - \textcolor{blue}{b_j}}{\,\textcolor{green}{b_{j+1}} - \textcolor{blue}{b_j}\,},

where \(b_{j} < i < b_{j+1}\)

\(b_{j}\) - closest retrieved chunk to the left

\(b_{j+1}\) - closest retrieved chunk to the right

 

  • RoPE + \(\rho_t(i)\) is used to rotate chunk embeddings
  • encodes position within the document and relative to retrieved chunks

 

Results: BabiLong

BabiLong is commonsense and temporal reasoning benchmark for ultra-long contexts

  • Q-RAG achieves state of the art performance  (from 128k to 10M tokens)
  • The gap is bigger on harder tasks like (QA3)

Results: RULER

RULER benchmark:

  • different types of NIAH tasks 
  • couple of Open domain QA tasks


Results:

  • Near-perfect retrieval on NIAH
  • Generalizes to 1M tokens
  • Best on multi-hop QA
  • Only mild degradation with length increase

 

Results: Open Domain QA

Results on Open-Domain QA

  • Competitive results on HotPotQA and MuSiQue
  • Cheaper training than LLM fine-tuning methods
  • Faster inference than Beam Retriever and Graph based methods

 

Training and Inference Cost

Q-RAG fine-tunes only the retriever/embedder and keeps the answering LLM frozen. 

  • Main experiments fit on a single A100-80GB GPU
  • Fine-tuning of one embedder model takes around 12 hours
  • Faster inference than Beam Retriever and Graph based methods

 

Conclusion

  • Fine-tuning the embedder for multi-step retrieval is  cheaper than LLM
  • Competitive Quality on Open Domain QA
  • State of the Art on BabiLong and RULER
  • The main bottleneck is dependence on support-chunk annotations

An open question for future work:

  •  can generated data or LLM-based rewards remove dependence on support-chunk annotations?

 

 

 

 

Thanks For Your Attention!

CODE

PAPER

CHECKPOINTS

Contacts: griver29@gmail.com or n.buzun@seevia.ai

Thanks for your Attention!

Thanks for your Attention!