Adversary Alignment 

Xiao Yao

0915

Background: RLHF

Policy Model

Instruction from dataset

Sampling

responses

Reward Model

PPO

Training language models to follow instructions with human feedback (NIPS 2022)

🔥

Background: Hindsight

The Wisdom of Hindsight Makes Language Models Better Instruction Followers (ICML 2023)

Policy Model

Instruction from dataset

Sampling  responses

 

Judgement

🔥

Generate a correct answer to this problem + response

 Generate a wrong answer to this problem + response

SFT

Background: DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NIPS 2023)

Stage 1

SFT on the instruction response pair

Policy Model

🔥

Stage 2

Policy Model

🔥

Instruction

    +  Response 1

+  Response 2

Instruction

My Trial

Stage 1

SFT on the instruction response pair

Policy Model

🔥

Stage 2

Policy Model

🔥

Instruction

    +  [GOOD] + Response 1

Instruction

    +  [GOOD] + Response 2

Instruction

    +  [BAD] + Response 1

Instruction

    +  [BAD] + Response 2

My Preliminary Experiment

Training Dataset: HH published by Anthropic (16W preference data)

Model: Llama3 8B

Evaluation: Alpaca eval2

Method Win Rate  LC Win Rate
SFT 0.9 1.45
DPO 4.08 4.88
Ours 4.08 5.61

My Preliminary Experiment

Training Dataset: Ultra-feedback from Tsinghua(6W preference data)

Model: Llama3 8B

Evaluation: Alpaca eval2

deck

By Yao

deck

  • 10