A GenAI case study

How GenAI is used in the legal sector

Chris Price

Results

Background

Challenges

Lessons

RESULTS

Evaluate an LLM (Large Language Model) based QA (Question Answer) system focussed on reviewing legal documents (e.g. contracts)

PROBLEM

Results

Baseline

Intentionally naive implementation to set a performance baseline

50%

TARGET

Comparative performance to existing manual techniques

>95%

Achieved

Picking the best performing technique per sample

85%

Background

The Engagement

A Legal Technology Company
Offering technology solutions to improve document creation, analysis, and management for legal teams
6 months

〞

Who are the parties to these contracts?

– System User

1

Upload documents of interest

2

Specify extraction models of interest based upon the question

4

Review extractions to evidence an answer to the question

3

System extracts matching passages for each extraction model, for each document

5

Apply legal knowledge to refine the answer based on evidence

1

Upload documents of interest

2-4

Ask question of LLM based QA system to produce initial evidenced answer

5

Apply legal knowledge to refine the answer based on evidence

Challenges

ExpectationS

Linear progression
Inaccessible tools

Rapid initial progress
Engaged stakeholders

POSITIVES

Cherry-picked is not consistent
Slow subsequent progress

NEGATIVES

Cycle Times

Instant feedback
Purposeful iteration

Slow models
QA assesment is hard

Negatives

Slower humans

MORE Negatives

Relative Value

Well-tested system
Even split

Undemanding code
Prompt decomposition

Positives

Creating/finding, transforming, refining, etc. test data

Negatives

Ecosystem Maturity

Best practices
De-facto tools
Little churn

New techniques
(RAG > HyDE > RRR)
New capabilities
(gpt-4 8k > gpt-4 128k)

Positives

Evolving best practices
Immature tools

Negatives

Lessons

ExpectationS

Expect rapid initial progress followed by prolonged iteration
Expect significant investment in problem definition and testing

Cycle TImes

Minimise human assessment and ensure active engagement
Automated assessment is necessary for rapid iteration

Relative Value

Code development activities will be insignificant alongside testing
The system's test data will be its most valuable commodity

LESSONS

Expectations - N% accurate != N% done

Cycle Times - Minimise human assessment

Relative Value - Significantly concentrated in testing

Ecosystem Maturity - ...

Ecosystem Maturity

We've been here before...

Does your problem require a bespoke AI solution now?

Conclusion

Chris Price

Can you directly empower the individuals within your team with this raw capability and observe the emergent behaviour?

Non-dev focused

Expectation management - Cherry-picked good examples rarely translate into consistently good performance.

Expectation management - More akin to customising a product than greenfield development.

Everyone engaged with model capability exploration (very low barrier to entry).

Expect rapid initial progress followed by prolonged and costly iteration. Significant investment in problem definition, testing and model calls.

Assessing QA performance is hard. NON TECH AUDIENCE WHAT DO WE MEAN BY PERF.

Testing - Cycle time is fundamental to purposeful iteration.

Prompting models is slow, expensive & hard.

Prompting humans is slower, more expensive & harder.

Minimise human assessment. Actively engage with any human reviewers to understand the subtleties. PARALLELS WITH OLD FASHIONED OPINIONS ON TEST TEAMS

Testing - Gathering sample data is hard.

Often better to decompose complex prompts/problems for better performance. Overcomes "concentration" budget and bypasses prompt engineering (at the cost of runtime).

Small amount of entropy for massive investment. Easy to walk out the door with knowledge. Test data is the real value.

Test data has significantly more valuable than before (non-ai dev)/than implementation(?).

Ecosystem speed - gpt4 (8k) was cutting edge when we started. gpt-4-1106-preview (128k) was available by the end.

Ecosystem speed - techniques evolving/named e.g. RAG.

You will spend a lot of time developing things you would expect to already exist in a more typical development project. NOT YET A COMMODITISED ECOSYSTEM.

Is waiting an option? RECAP WEB LANDSCAPE. WHAT IS THE FORCING FUNCTION FOR AI?

BOB

Dev focused

There's not much code to write and the code that there is, is often simplistic/inconsequential.

Langchain is a melting pot of good ideas.

Lots of on the surface minor changes that take a lot of time to implement.

Staying motivated is hard.

How will you spend your time as a dev of a qa system. Lots of frustration followed by short breakthrough periods. Breakthroughs are rewarding, often genuinely new territory.

Easy to get distracted "writing code".

Immature tooling (is it the web in '95, Spring Boot in 2015)

User focussed tooling (UX considerations, unpredictable/adversarial input) but visually uninspiring/unimpressive. More similar to a backend project than a frontend project.

The Inside LIne

From one developer to another...

LangChain 🤬

A frustrating to use, melting pot of good (and bad) ideas

Stop Writing Code

There's very little value in it

Staying Motivated...

...is hard.