Name: Xiao Yao (PhD 2Y)
Eduction: 2013-2017 Tianjin U
2019-2022 Shanghai JiaoTong U
2023-Now SUTD
Supervisor: Roy Lee, Li Xiaoli
Current interest: Reasoning、Alignment
Date: 0110
Xiao Yao
Will be submit to ACL 2025
A conventional preference data construction
Model: Mistral-7B-v1/instruct,Llama3-8B/instruct
RM: Armo
Prompt: Ultrafeedback
Evaluation: Alpaca Eval2
How should we construct
preference data
for
DPO
given sufficient sample budgets?
Implementation:
Position
Pairs
Reward Model: Armo
The performance of trained models is steadily improving if we increase the number of samples from 5 to 200, although with diminishing returns in some cases.
Model: Llama3-8B-instruct
No performance drop