A GenAI case study
How GenAI is used in the legal sector
Chris Price
RESULTS
Evaluate an LLM (Large Language Model) based QA (Question Answer) system focussed on reviewing legal documents (e.g. contracts)
PROBLEM
Results
Baseline
Intentionally naive implementation to set a performance baseline
50%
TARGET
Comparative performance to existing manual techniques
>95%
Achieved
Picking the best performing technique per sample
85%
Background
The Engagement
- A Legal Technology Company
- Offering technology solutions to improve document creation, analysis, and management for legal teams
-
6 months

〞
Who are the parties to these contracts?
– System User
1
Upload documents of interest
2
Specify extraction models of interest based upon the question
4
Review extractions to evidence an answer to the question
3
System extracts matching passages for each extraction model, for each document
5
Apply legal knowledge to refine the answer based on evidence
1
Upload documents of interest
2-4
Ask question of LLM based QA system to produce initial evidenced answer
5
Apply legal knowledge to refine the answer based on evidence
Challenges
ExpectationS
- Linear progression
- Inaccessible tools
- Rapid initial progress
- Engaged stakeholders
POSITIVES
- Cherry-picked is not consistent
- Slow subsequent progress
NEGATIVES
Cycle Times
- Instant feedback
- Purposeful iteration
- Slow models
- QA assesment is hard
Negatives
- Slower humans
MORE Negatives
Relative Value
- Well-tested system
- Even split
- Undemanding code
- Prompt decomposition
Positives
- Creating/finding, transforming, refining, etc. test data
Negatives
Ecosystem Maturity
- Best practices
- De-facto tools
- Little churn
- New techniques
(RAG > HyDE > RRR) - New capabilities
(gpt-4 8k > gpt-4 128k)
Positives
- Evolving best practices
- Immature tools
Negatives
Lessons
ExpectationS
- Expect rapid initial progress followed by prolonged iteration
- Expect significant investment in problem definition and testing
Cycle TImes
- Minimise human assessment and ensure active engagement
- Automated assessment is necessary for rapid iteration
Relative Value
- Code development activities will be insignificant alongside testing
- The system's test data will be its most valuable commodity
LESSONS
Ecosystem Maturity
We've been here before...
Does your problem require a bespoke AI solution now?
Conclusion
Chris Price
Can you directly empower the individuals within your team with this raw capability and observe the emergent behaviour?
Non-dev focused
Expectation management - Cherry-picked good examples rarely translate into consistently good performance.
Expectation management - More akin to customising a product than greenfield development.
Everyone engaged with model capability exploration (very low barrier to entry).
Expect rapid initial progress followed by prolonged and costly iteration. Significant investment in problem definition, testing and model calls.
Assessing QA performance is hard. NON TECH AUDIENCE WHAT DO WE MEAN BY PERF.
Testing - Cycle time is fundamental to purposeful iteration.
Prompting models is slow, expensive & hard.
Prompting humans is slower, more expensive & harder.
Minimise human assessment. Actively engage with any human reviewers to understand the subtleties. PARALLELS WITH OLD FASHIONED OPINIONS ON TEST TEAMS
Testing - Gathering sample data is hard.
Often better to decompose complex prompts/problems for better performance. Overcomes "concentration" budget and bypasses prompt engineering (at the cost of runtime).
Small amount of entropy for massive investment. Easy to walk out the door with knowledge. Test data is the real value.
Test data has significantly more valuable than before (non-ai dev)/than implementation(?).
Ecosystem speed - gpt4 (8k) was cutting edge when we started. gpt-4-1106-preview (128k) was available by the end.
Ecosystem speed - techniques evolving/named e.g. RAG.
You will spend a lot of time developing things you would expect to already exist in a more typical development project. NOT YET A COMMODITISED ECOSYSTEM.
Is waiting an option? RECAP WEB LANDSCAPE. WHAT IS THE FORCING FUNCTION FOR AI?
BOB
Dev focused
There's not much code to write and the code that there is, is often simplistic/inconsequential.
Langchain is a melting pot of good ideas.
Lots of on the surface minor changes that take a lot of time to implement.
Staying motivated is hard.
How will you spend your time as a dev of a qa system. Lots of frustration followed by short breakthrough periods. Breakthroughs are rewarding, often genuinely new territory.
Easy to get distracted "writing code".
Immature tooling (is it the web in '95, Spring Boot in 2015)
User focussed tooling (UX considerations, unpredictable/adversarial input) but visually uninspiring/unimpressive. More similar to a backend project than a frontend project.
The Inside LIne
From one developer to another...
LangChain 🤬
A frustrating to use, melting pot of good (and bad) ideas
Stop Writing Code
There's very little value in it
Staying Motivated...
...is hard.
A GenAI Case Study
By Chris Price
A GenAI Case Study
How GenAI is used in the legal sector
- 260