Xiao Yao
0915
Policy Model
Instruction from dataset
Sampling
responses
Reward Model
PPO
Training language models to follow instructions with human feedback (NIPS 2022)
🔥
The Wisdom of Hindsight Makes Language Models Better Instruction Followers (ICML 2023)
Policy Model
Instruction from dataset
Sampling responses
Judgement
🔥
Generate a correct answer to this problem + response
Generate a wrong answer to this problem + response
SFT
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NIPS 2023)
Stage 1
SFT on the instruction response pair
Policy Model
🔥
Stage 2
Policy Model
🔥
Instruction
+ Response 1
+ Response 2
Instruction
Stage 1
SFT on the instruction response pair
Policy Model
🔥
Stage 2
Policy Model
🔥
Instruction
+ [GOOD] + Response 1
Instruction
+ [GOOD] + Response 2
Instruction
+ [BAD] + Response 1
Instruction
+ [BAD] + Response 2
Training Dataset: HH published by Anthropic (16W preference data)
Model: Llama3 8B
Evaluation: Alpaca eval2
Method | Win Rate | LC Win Rate |
---|---|---|
SFT | 0.9 | 1.45 |
DPO | 4.08 | 4.88 |
Ours | 4.08 | 5.61 |
Training Dataset: Ultra-feedback from Tsinghua(6W preference data)
Model: Llama3 8B
Evaluation: Alpaca eval2