From Arxiv
For example, can be sum of answer accuracy and format match.
This demonstrates a fundamental limitation of GRPO’s advantage calculation in multi-reward optimization which over-compresses the rich group-wise reward signal.
Intuitively, (0, 2) should produce a stronger learning signal than (0, 1) because a total reward of 2 indicates simultaneous satisfaction of two rewards, whereas a reward of 1 corresponds to achieving only one.
Weighted
BFCL-v3 evaluation
Model: DeepSeek-R1-1.5B
Dataset: DeepScaler
We see that GRPO training starts to destabilize after 400 steps with the correctness rewards score gradually decreasing while GDPO continue to improve the correctness score.