
ICLR 2026(8664)
Background
- Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of vlms.
- However, these works largely overlook the enhancement of multimodal perception capabilities in vlms, which serve as a core prerequisite and foundational component of complex multimodal reasoning.

Problem Formulation

comprises visual input V (e.g., image), a textual query Q, and the corresponding ground-truth answer a.
Given a data sample xi ∈ D as input, the vlm is required to generate a textual token sequence y that aims to reach the ground-truth answer a.
RLVR
Reward Functions:
The reward functions consist of two components:
(1). Format Reward encourages MLLMs to generate in a structured “think-then-answer” format, with the reasoning process enclosed in tags and the answer enclosed in tags.
(2). Accuracy Reward drives the reasoning optimization in RLVR training by evaluating the correctness of predicted answer.

Method: PERCEPTION-R1


Final reward
r_p is the repetition penalty reward that discourage repetitive behavior during MLLMs’ generation.
How Visual Perception Score Is Computed

Given image and query, they first prompt frontier LLMs (Gemini2.5 pro to generate reference visual perceptions.

During training, they we use LLM-as-Judge to determine if visual perception above presented in the sampled trajectories y_i.

where o_i,j ∈ {0, 1} indicates whether v_j is accurately reflected in y_i or not.
Experiments
Training Dataset: Geometry3K
LLM-as-judge model: Qwen2.5-32B-IT
Benchmarks and Evalution Settings: MathVista,
MathVerse, WeMath and more
Training Model: Qwen2-VL-7B-IT
Results

Results

Ablation

More Ablations

Thx
deck
By Yao
deck
- 5