Patience is all you need

Adam Lass

Supervised by Luca Maria Aiello

Assessing the Impact of Partial Observations

when Fine-Tuning LLMs using Reinforcement Learning

Abstract

Despite the emerging capabilities of large-language models (LLMs) bringing AI into the public discourse, the technology itself often struggles to engage responsively in such conversational settings. Being primarily pre-trained to solve tasks in fully realised contexts, LLMs often have difficulty interpreting the partial observations inherent to these low-latency inference settings. While previous work has explored more efficient adaptations to address these challenges, their methods and training often rely on extensive prompt engineering and/or imitation learning, making their approaches less scalable and robust to novel settings. Meanwhile, recent reinforcement learning (RL) practices have begun to ground LLM agents in online environments successfully; however, whether this performance holds with partial observations remains unknown. We propose Hold The Line (HTL), a novel RL environment tailored to test how input fragmentation affects the performance of an LLM agent tasked to patiently navigate an Interactive Voice Response (IVR) system based on a user goal. Peaking only at 13% terminal accuracy, we discover that receiving partial observations has a significant negative impact on performance when compared to the corresponding result of 91% with full observations. Here, we discuss how several challenging factors, such as sparsity, reasoning derailment, and reward-sensitivity, might drive this gap, and suggest future directions to help mitigate them.

Abstract

Situation

While previous work has explored more efficient adaptations to address these challenges, their methods and training often rely on extensive prompt engineering and/or imitation learning, making their approaches less scalable and robust to novel settings. Meanwhile, recent reinforcement learning (RL) practices have begun to ground LLM agents in online environments successfully; however, whether this performance holds with partial observations remains unknown.

Complication

Situation

Complication

Situation

We propose Hold The Line (HTL), a novel RL environment tailored to test how input fragmentation affects the performance of an LLM agent tasked to patiently navigate an Interactive Voice Response (IVR) system based on a user goal.

Proposal

Complication

Situation

Proposal

Peaking only at 13% terminal accuracy, we discover that receiving partial observations has a significant negative impact on performance when compared to the corresponding result of 91% with full observations. Here, we discuss how several challenging factors, such as sparsity, reasoning derailment, and reward-sensitivity, might drive this gap, and suggest future directions to help mitigate them.

Contribution

Outline

Introduction
Related Work
Hold The Line
Fine-Tuning Methods
Experiments
Results
Discussion
Conclusion
Questions

Introduction

Why transformers?

Why transformers?

LLMs

Architecture

Attention

Computational Requirements

Architecture

LLMs

Attention

Turn based

Architecture

LLMs

Attention

Fully Realisted Context

Low-latency Environments

Examples of Low-latency Environments

Implications of Low-latency Environments

High frequency inference

Input fragmentation

Implications of Low-latency Environments

High frequency inference

Input fragmentation

Embodied

Full-duplex Dialogue

What is Full-duplex Dialogue?

What is Full-duplex Dialogue?

What is Full-duplex Dialogue?

What are partial observations exactly?

Action

Full observation of object (sentence)

What are

partial observations

exactly?

Partial observations

What are

partial observations

exactly?

Action

Hypothesis

Related Work

Full-duplex Dialogue

Evolution of Spoken Dialogue Systems

Intelligent Assistants

such as Google Assistant, Siri and Alexa

Dialogue State Tracking

Evolution of Spoken Dialogue Systems

Google Duplex

DulVRS

POI attribution acquisition by Baidu

A Full-duplex Speech Dialogue

Scheme Based On Large Language Model

Wang et al., 2024

Wang et al., 2024

Wang et al., 2024

Prompt Engineering

Wang et al., 2024

Reinforcement Learning

Proximal Policy Optimisation Algorithms

(PPO)

Schulman et al., 2017 (OpenAI)

PPO

Schulman et al., 2017

PPO

Schulman et al., 2017

Actor

Critic

-1

PPO

Schulman et al., 2017

Actor

Critic

-1

Advantage

Value

Reward

PPO

Schulman et al., 2017

Actor

Critic

-1

Advantage

+ Advantage

Value

Reward

PPO

Schulman et al., 2017

Actor

Critic

-1

Advantage

Value

- Advantage

Reward

PPO

Policy Probability Ratio

Advantage

PPO

Schulman et al., 2017

Clipped

Surrogate

Objective

Advantage Function

PPO

Policy Probability

Ratio

Advantage Function

Clipped Surrogate Objective

PPO

Clipped Surrogate Objective

PPO

Schulman et al., 2017

Actions are

single probabilities

LLM Compatibility Problem

Grounding Large Language Models in Interactive
Environments with Online Reinforcement Learning

(GLAM)

Carta et al., 2023

GLAM

Carta et al., 2023

Action Probability as the Product of its Token Probabilities

GLAM

Carta et al., 2023

GLAM

Carta et al., 2023

Action Probability as the Product of its Token Probabilities

Small

Encoder-Decoder

Model

FLAN-T5 780M

Value head (Critic)

added to the Model

GLAM

Carta et al., 2023

Action Probability as the Product of its Token Probabilities

Small Encoder-

Decoder Model

Value head (Critic)

added to the Model

Minimalistic

System Prompt

GLAM

Carta et al., 2023

GLAM

Carta et al., 2023

GLAM

Carta et al., 2023

TRUE KNOWLEDGE COMES FROM PRACTICE:
ALIGNING LLMS WITH EMBODIED ENVIRONMENTS
VIA REINFORCEMENT LEARNING

(TWOSOME)

Tan et al., 2024

TWOSOME

Tan et al., 2024

Token & Word

Normalisation

Decoder only model

LoRA

Adapter

TWOSOME

Full Duplex Dialogue

RL using LLMs

GLAM

BAD/POAD

Wang et al.

LSLM

SyncLM

Moshi

SALMONN-omni

TWOSOME

Full Duplex Dialogue

RL using LLMs

GLAM

BAD/POAD

Wang et al.

LSLM

SyncLM

Moshi

SALMONN-omni

Full Duplex Dialogue

RL using LLMs

Low latency

Partial Observations

Full Observations

Turn based

Imitation Learning

RL in embodied

Environments

Full Duplex Dialogue

RL using LLMs

Low latency

Partial Observations

Full Observations

Turn based

Imitation Learning

RL in embodied

Environments

The Gap

Research Question

How do partial observations impact the fine-tuning of LLMs when using reinforcement learning?

Hold The Line

Hold The Line (HTL)

Scope

HTL

Data Generation

HTL

o4-mini

Simulation Design

HTL

Fine-Tuning Methods

TWOSOME

GLAM

Fine-Tuning Methods

TWOSOME

GLAM

Equal-length

Action Space

Integrated

Value head (Critic)

Minimalistic

System Prompt

Decoder only model

LoRA

Adapter

Advantage

Normalisation

Methodological Advancements

Memory Optimisations

Key-Value

Caching

Chat

Completion

Optimised Action Sampling

Transceiver Pattern

Randomising Dial Numbers

Experiments

Results

Complete Input

84% Initial Terminal Accuracy

Correctly Guessing between 9 Dial Options

2 consecutive times:

(1/9)^2 = 1.2% Terminal Accuracy

Results

Full Observations

Results

Full Observations

50% Initial

Terminal Accuracy

91% Peak

Terminal Accuracy

Full Observations

Action Probabilities

Results

Partial Observations

Results

Partial Observations

1.1% Initial

Terminal Accuracy

13.0% Peak

Terminal Accuracy

Partial Observations

Action Probabilities

DEMO

HoldTheLine.AI

DEMO

Full Observations

DEMO

Partial Observations

Discussion

Full Observations

Discussion

Full Observations

91% Peak Terminal Accuracy

Perfect Wait for Welcome Ability

Appropriate Patience

Discussion

Partial Observations

Discussion

Partial Observations

Only 13% Peak Terminal Accuracy

10x Improvement

Chance-level Initial Terminal Accuracy

Discussion

Impact of Observational Settings

	Initial Term. Acc.	Peak Term. Acc.
Complete Input	84%
Full Observations	50%	91%
Partial Observations	1.1%	13%

Discussion

	Initial Term. Acc.	Peak Term. Acc.
Complete Input	84%
Full Observations	50%	91%
Partial Observations	1.1%	13%

Can we trust the Evaluation?

Data Quality

Impact of Overfitting

Non-Deterministic Evaluation

Discussion

	Initial Term. Acc.	Peak Term. Acc.
Complete Input	84%
Full Observations	50%	91%
Partial Observations	1.1%	13%

Can we trust the Evaluation?

Discussion

Can we trust the Evaluation?

Yes (for the most part)

Discussion

Investigating Partial Observations

What if we extended the fine-tuning?

Discussion

Investigating Partial Observations

What if we extended the fine-tuning?

Discussion

Sparsity

Hypothesis

Sparse rewards confuse the policy and critic on the objective of the environment

Discussion

Sparsity

Experiment 1

Reduced Call Trees to a binary decision to test the impact of shorter episode lengths

Discussion

Sparsity

Experiment 1

Reduced Call Trees to a binary decision to test the impact of shorter episode lengths

Finding (1)

Potential sensitivity to the reward function

Finding (2)

Baseline performance improvement indicated that cleaning the prompt context improved the model's baseline understanding ability

Discussion

Sparsity

Experiment 2

Incremented chunking frequency to test the impact on baseline performance

Discussion

Sparsity

Experiment 2

Finding

Sparsity seems to hurt the model’s baseline understanding ability

Discussion

Prompt Context Noise

Hypothesis

Previous observations and actions can negatively impact future reasoning

Discussion

Prompt Context Noise

Experiment

Removed wait for input from the prompt context

technical support

Wait

press 8.

Press 8

For billing

Wait

Discussion

Prompt Context Noise

Experiment

Removed wait for input from the prompt context

technical support

press 8.

Press 8

For billing

Press 8

Discussion

Prompt Context Noise

Experiment

Removed wait for input from the prompt context

Discussion

Prompt Context Noise

Experiment

Removed

wait for input

from the prompt context

Finding

Signs of reasoning derailment

Discussion

Reward Sensitivity

Hypothesis

Slight variations in the reward function can cause dramatically different outcomes

Discussion

Reward Sensitivity

Experiment

Slightly lowered the punishment
of waiting on silence

Discussion

Reward Sensitivity

Experiment

Slightly lowered the punishment
of waiting on silence

Finding

The fine-tuning seemed to be to sensitive the reward function

Conclusion

Confirmed that Partial Observations

hurts RL fine-tuning

Sparsity

Reasoning Derailment

Reward Sensitivity

Future Directions

HTL: A Novel RL Environment

Patience Is All You Need

Questions

?

Extra Slides

GLAM