deck

Adversary Alignment

Xiao Yao

0915

Background: RLHF

Policy Model

Instruction from dataset

Sampling

responses

Reward Model

PPO

Training language models to follow instructions with human feedback (NIPS 2022)

🔥

Background: Hindsight

The Wisdom of Hindsight Makes Language Models Better Instruction Followers （ICML 2023）

Policy Model

Instruction from dataset

Sampling responses

Judgement

🔥

Generate a correct answer to this problem + response

Generate a wrong answer to this problem + response

SFT

Background: DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model （NIPS 2023）

Stage 1

SFT on the instruction response pair

Policy Model

🔥

Stage 2

Policy Model

🔥

Instruction

+ Response 1

+ Response 2

Instruction

My Trial

Stage 1

SFT on the instruction response pair

Policy Model

🔥

Stage 2

Policy Model

🔥

Instruction

+ [GOOD] + Response 1

Instruction

+ [GOOD] + Response 2

Instruction

+ [BAD] + Response 1

Instruction

+ [BAD] + Response 2

My Preliminary Experiment

Training Dataset: HH published by Anthropic （16W preference data）

Model: Llama3 8B

Evaluation: Alpaca eval2

Method	Win Rate	LC Win Rate
SFT	0.9	1.45
DPO	4.08	4.88
Ours	4.08	5.61

My Preliminary Experiment

Training Dataset: Ultra-feedback from Tsinghua（6W preference data）

Model: Llama3 8B

Evaluation: Alpaca eval2