ICLR 2026(8664)

Background

Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of vlms.
However, these works largely overlook the enhancement of multimodal perception capabilities in vlms, which serve as a core prerequisite and foundational component of complex multimodal reasoning.

Problem Formulation

comprises visual input V (e.g., image), a textual query Q, and the corresponding ground-truth answer a.

Given a data sample xi ∈ D as input, the vlm is required to generate a textual token sequence y that aims to reach the ground-truth answer a.

RLVR

Reward Functions:

The reward functions consist of two components:

(1). Format Reward encourages MLLMs to generate in a structured “think-then-answer” format, with the reasoning process enclosed in tags and the answer enclosed in tags.

(2). Accuracy Reward drives the reasoning optimization in RLVR training by evaluating the correctness of predicted answer.