Logic-RL

Unleashing LLM Reasoning with
Rule-Based Reinforcement Learning

Amin - March 5 2025

LLM Reasoning Reading Group

Overview

Train a pre-trained LLM on a class of Logical Puzzles (Knights & Knaves) using Rule-Based Reinforcement Learning.
Model learns to solve these puzzles using techniques like backtracking, exploring and verifying.
Learnt behaviours seem to extrapolate to other tasks that involve reasoning, namely on AIME and AMC benchmarks.
Ablation studies stimulate interesting research questions.

Overview

Setup: Problem

Instances of K&K can be procedurally generated.
Instances can have varying levels of difficulty.
Instances are easy to verify (single, unambigouous solutions).

Setup: Reward

To prevent reward hacking:
- Model is asked to think within </think> and answer within </answer> tags.
- Reward is calculated based on the format and correctness of the response.

S_{\text{format}} = \begin{cases} 1, & \text{if format is correct} \\ -1, & \text{if format is incorrect} \end{cases}

S_{\text{answer}} = \begin{cases} 2, & \text{if the answer fully matches the ground truth} \\ -1.5, & \text{if the answer partially mismatches the ground truth} \\ -2, & \text{if the answer cannot be parsed or is missing} \end{cases}

Setup: RL Algorithm

The algorithm of choice is REINFORCE++ with slight modifications (there are ablations against PPO and GRPO).
Modification 1: KL between response distribution of the RL model and the SFT model is calculated on a per-token basis and incorporate in the loss (similar to GRPO, over being part of the Reward).
Modification 2: An unbiased KL-Estimator similar to that of GRPO is used.

Setup: Hyperparameters

The model (Qwen2.5-7B) is trained over 3600 steps, with a constant LR of $4 \cdot 10^-7$ and a temperature of 0.7 on puzzles having mixed complexity and ranging from 3 to 7 people.
Hyper-parameters used for SFT are not discussed.
There is no ablation on hyper-parameters.

Background: REINFORCE

Introduction of the discounted cumulative rewards:
$$G_t = \sum_{k=t+1}^{T} \gamma^{k-t} r_k$$
The policy gradient is computed as:
$$\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi} \left[ G_t \nabla_{\theta} \log \pi_{\theta} (A_t \mid S_t) \right]$$
The parameters are updated as:
$$\theta \gets \theta + \alpha \nabla_{\theta} J(\theta)$$
Simple direct preference optimization algorithm, but suffers from

Background: REINFORCE++

Addition of token-level KL loss to the reward:
$$\operatorname{KL}(t) = \log \left( \frac{\pi_{\theta_{\text{old}}}^{\text{RL}}(a_t \mid s_t)}{\pi^{\text{SFT}}(a_t \mid s_t)} \right)$$
PPO-Clip Integration:
$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \operatorname{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]$$
Reward Normalization
$$r_t(\theta) = \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$$
Advantage Normalization
$$A_t(s_t, a_t) = r(x, y) - \beta \cdot \sum_{i=t}^{T} \operatorname{KL}(i)$$