How GenAI is used in the legal sector
Chris Price
Evaluate an LLM (Large Language Model) based QA (Question Answer) system focussed on reviewing legal documents (e.g. contracts)
Intentionally naive implementation to set a performance baseline
50%
Comparative performance to existing manual techniques
>95%
Picking the best performing technique per sample
85%
6 months
Who are the parties to these contracts?
– System User
Upload documents of interest
Specify extraction models of interest based upon the question
Review extractions to evidence an answer to the question
System extracts matching passages for each extraction model, for each document
Apply legal knowledge to refine the answer based on evidence
Upload documents of interest
Ask question of LLM based QA system to produce initial evidenced answer
Apply legal knowledge to refine the answer based on evidence
We've been here before...
Does your problem require a bespoke AI solution now?
Chris Price
Can you directly empower the individuals within your team with this raw capability and observe the emergent behaviour?
Non-dev focused
Expectation management - Cherry-picked good examples rarely translate into consistently good performance.
Expectation management - More akin to customising a product than greenfield development.
Everyone engaged with model capability exploration (very low barrier to entry).
Expect rapid initial progress followed by prolonged and costly iteration. Significant investment in problem definition, testing and model calls.
Assessing QA performance is hard. NON TECH AUDIENCE WHAT DO WE MEAN BY PERF.
Testing - Cycle time is fundamental to purposeful iteration.
Prompting models is slow, expensive & hard.
Prompting humans is slower, more expensive & harder.
Minimise human assessment. Actively engage with any human reviewers to understand the subtleties. PARALLELS WITH OLD FASHIONED OPINIONS ON TEST TEAMS
Testing - Gathering sample data is hard.
Often better to decompose complex prompts/problems for better performance. Overcomes "concentration" budget and bypasses prompt engineering (at the cost of runtime).
Small amount of entropy for massive investment. Easy to walk out the door with knowledge. Test data is the real value.
Test data has significantly more valuable than before (non-ai dev)/than implementation(?).
Ecosystem speed - gpt4 (8k) was cutting edge when we started. gpt-4-1106-preview (128k) was available by the end.
Ecosystem speed - techniques evolving/named e.g. RAG.
You will spend a lot of time developing things you would expect to already exist in a more typical development project. NOT YET A COMMODITISED ECOSYSTEM.
Is waiting an option? RECAP WEB LANDSCAPE. WHAT IS THE FORCING FUNCTION FOR AI?
Dev focused
There's not much code to write and the code that there is, is often simplistic/inconsequential.
Langchain is a melting pot of good ideas.
Lots of on the surface minor changes that take a lot of time to implement.
Staying motivated is hard.
How will you spend your time as a dev of a qa system. Lots of frustration followed by short breakthrough periods. Breakthroughs are rewarding, often genuinely new territory.
Easy to get distracted "writing code".
Immature tooling (is it the web in '95, Spring Boot in 2015)
User focussed tooling (UX considerations, unpredictable/adversarial input) but visually uninspiring/unimpressive. More similar to a backend project than a frontend project.
From one developer to another...
A frustrating to use, melting pot of good (and bad) ideas
There's very little value in it
...is hard.