Logic-RL

Unleashing LLM Reasoning with
Rule-Based Reinforcement Learning

Amin - March 5 2025

LLM Reasoning Reading Group

Overview

  • Train a pre-trained LLM on a class of Logical Puzzles (Knights & Knaves) using Rule-Based Reinforcement Learning.
     
  • Model learns to solve these puzzles using techniques like backtracking, exploring and verifying.
     
  • Learnt behaviours seem to extrapolate to other tasks that involve reasoning, namely on AIME and AMC benchmarks.
     
  • Ablation studies stimulate interesting research questions

Overview

Setup: Problem

  • Instances of K&K can be procedurally generated.
     
  • Instances can have varying levels of difficulty.
     
  • Instances are easy to verify (single, unambigouous solutions).

Setup: Reward

  • To prevent reward hacking:
    • Model is asked to think within </think> and answer within </answer> tags.
    • Reward is calculated based on the format and correctness of the response.

  •  
S_{\text{format}} = \begin{cases} 1, & \text{if format is correct} \\ -1, & \text{if format is incorrect} \end{cases}
S_{\text{answer}} = \begin{cases} 2, & \text{if the answer fully matches the ground truth} \\ -1.5, & \text{if the answer partially mismatches the ground truth} \\ -2, & \text{if the answer cannot be parsed or is missing} \end{cases}

Setup: RL Algorithm

  • The algorithm of choice is REINFORCE++ with slight modifications (there are ablations against PPO and GRPO).
     
  • Modification 1: KL between response distribution of the RL model and the SFT model is calculated on a per-token basis and incorporate in the loss (similar to GRPO, over being part of the Reward).
     
  • Modification 2: An unbiased KL-Estimator similar to that of GRPO is used. 

Setup: Hyperparameters

  • The model (Qwen2.5-7B) is trained over 3600 steps, with a constant LR of \(4 \cdot 10^-7\) and a temperature of 0.7 on puzzles having mixed complexity and ranging from 3 to 7 people.
     
  • Hyper-parameters used for SFT are not discussed.
     
  • There is no ablation on hyper-parameters.

Background: REINFORCE

  • Introduction of the discounted cumulative rewards:
    $$G_t = \sum_{k=t+1}^{T} \gamma^{k-t} r_k$$
  • The policy gradient is computed as:
    $$\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi} \left[ G_t \nabla_{\theta} \log \pi_{\theta} (A_t \mid S_t) \right]$$
  • The parameters are updated as:
    $$\theta \gets \theta + \alpha \nabla_{\theta} J(\theta)$$
  • Simple direct preference optimization algorithm, but suffers from 

Background: REINFORCE++

  • Addition of token-level KL loss to the reward:
    $$\operatorname{KL}(t) = \log \left( \frac{\pi_{\theta_{\text{old}}}^{\text{RL}}(a_t \mid s_t)}{\pi^{\text{SFT}}(a_t \mid s_t)} \right)$$
  • PPO-Clip Integration:
    $$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \operatorname{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]$$
  • Reward Normalization
    $$r_t(\theta) = \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$$
  • Advantage Normalization
    $$A_t(s_t, a_t) = r(x, y) - \beta \cdot \sum_{i=t}^{T} \operatorname{KL}(i)$$

Results

RQ-1: The RL Algorithm

RQ-2: Thinking Tokens

RQ-3: Is there an Aha moment?

RQ-4: Does Logic-RL Generalize?

RQ-5: Robustness to Perturbation - SFT vs RL

RQ-5: Robustness to Perturbation - SFT vs RL

\text{LiMem}(f; \mathcal{D}) = \text{Acc}(f; \mathcal{D}) \cdot (1 - \text{CR}(f; \mathcal{D}))

RQ-6: Effectiveness of Curriculum Learning 

RQ-7: Longer Responses vs Reasoning Performance

Thanks!

Logic-RL

By Amin Mohamadi

Logic-RL

  • 144