Name: Xiao Yao (PhD 2Y)
Eduction: 2013-2017 Tianjin U
2019-2022 Shanghai JiaoTong U
2023-Now SUTD
Supervisor: Roy Lee, Li Xiaoli
Current interest: Reasoning、Alignment
Date: 0110
Scaling Samples of Preference Data Construction for DPO
Xiao Yao
Will be submit to ACL 2025
Background

- Human annotation and LLM annotation are expensive
- Ultrafeeback, hh-rlhf
- More and more reward models are available on HF
- Armo RM, Skywork RM
A conventional preference data construction
A Failure Case of Conventional Strategy

Model: Mistral-7B-v1/instruct,Llama3-8B/instruct
RM: Armo
Prompt: Ultrafeedback
Evaluation: Alpaca Eval2
Question
How should we construct
preference data
for
DPO
given sufficient sample budgets?
What Does The Reward Look Like
Implementation:
- Model: Llama3-8B-instruct
- Prompt: 1000 prompts from Ultrafeedback
- Samples: 400 samples per prompt
- RM: Armo/Skywork RM
What Does The Reward Look Like

- The reward scores per prompt exhibits a Gaussian distribution
- Response rewards of approximately 20% prompts can perfectly pass the Kolmogorov-Smirnov test
Reward Points

\left\{min, \mu \pm 2\sigma, \mu\pm\sigma, \mu, max\right\}
Position
C_7^2\quad(21)
Pairs
- In practice, we select the sample point which has the closest reward score to the value in the set.
- We construct 21 preference pairs per prompt following the principle that the reward of the chosen should be higher than that of rejected.
Alpaca Eval2 Results

Main Results
- To achieve superior performance, the chosen response should be selected from , while the rejected response should be selected from .
- Preference pairs of small margins usually perform poorly. If the reward of the chosen response is slightly higher than that of the rejected response, models trained with them cannot achieve satisfactory performance.
- When the rejected responses are appropriately selected, the performance of trained models can improve as the reward of the chosen responses increases.
\left\{max, \mu+2\sigma\right\}
\left\{\mu-2\sigma\right\}

- None of preference pairs will degrade the performance of SFT checkpoint, which indicates the robustness of the DPO training.
Training Dynamics for Interpretation

- It can be seen that increasing the reward margin between the chosen and rejected responses may facilitate model training. Training loss can reach a lower bound when the reward margin increases.
- There is a strong correlation between the converged state of the loss and the performance of models. Specifically, models achieving lower loss values tend to demonstrate superior performance, indicating that minimizing the loss effectively enhances model capabilities.
Proposed Preference Dataset Construction Strategy

Results

Reward Model: Armo
The performance of trained models is steadily improving if we increase the number of samples from 5 to 200, although with diminishing returns in some cases.
Results on Skywork

Model: Llama3-8B-instruct
Results on Academic Benchmarks

No performance drop
Thanks
deck
By Yao
deck
- 28