Garbage in →
Pydantic →
you're golden!
by
Samuel Colvin
PyData London - 2023/06/03
https://london2023.pydata.org/cfp/talk/SFZLT7/
Today
- What is Pydantic, why do people seem to like it?
- Trouble in paradise
- Rust to the rescue - Good, Bad, Ugly
- Examples of how Rust helps Pydantic V2 solve your problems
- Live demo!
Pydantic
from datetime import datetime
from pydantic import BaseModel
class Talk(BaseModel):
title: str
when: datetime | None = None
mistakes: list[str]
Just type hints get you:
- Validation
- Coercion/tranformation
- Serialization
- JSON Schema
You people seemed to like it:
- 58m downloads/mo
- used by all FAANG companies
- 12% of pro web developers
30s - understand
3m - useful
300hr - usable
Empathy for the developers using our library
But there's a problem...
Pydantic V2
Priorities for V2:
- Performance - it was good, but could be better - think of the penguins!
- Strict Mode - live up to the name
- Composability - you don't always want a model
- Maintainability - I maintain Pydantic so I want maintaining Pydantic to be fun
Sad penguin, no snow
What would it look like if we started from scratch?
What about Rust?
The obvious advantages...
- Performance
- Multithreading - no GIL
- Reusing high quality rust libraries
- More explicit error handling
(maybe) Less obviously advantages:
- Virtually zero cost customisation, even in hot code
- Arguably easier to maintain - the compiler picks up more of mistake
Rust - the good
But perhaps most pertinent to Pydantic...
from pydantic import BaseModel
class Qualification(BaseModel):
name: str
description: str
required: bool
value: int
class Student(BaseModel):
id: int
name: str
qualifications: list[Qualification]
friends: list[int]
[
...,
...,
...,
...,
...,
...,
...,
...,
...,
...,
...,
...,
]
Rust loves this
- Deeply recursive code - no stack frames
- Small modular components
How Rust?
What does that tree look like?
class Talk(BaseModel):
title: Annotated[
str,
Maxlen(100)
]
attendance: PosInt
when: datetime | None = None
mistakes: list[
tuple[timedelta, str]
]
ModelValidator {
cls: Talk,
validator: TypeDictValidator [
Field {
key: "title",
validator: StrValidator { max_len: 100 },
},
Field {
key: "attendance",
validator: IntValidator { min: 0 },
},
Field {
key: "when",
validator: UnionValidator [
DateTimeValidator {},
NoneValidator {},
],
default: None,
},
Field {
key: "mistakes",
validator: ListValidator {
item_validator: TupleValidator [
TimedeltaValidator {},
StrValidator {},
],
},
},
],
}
Python Interface to Rust
from pydantic_core import SchemaValidator
class Talk:
...
talk_validator = SchemaValidator({
'type': 'model',
'cls': Talk,
'schema': {
'type': 'typed-dict',
'fields': {
'title': {'schema': {'type': 'str', 'max_length': 100}},
'attendance': {'schema': {'type': 'int', 'ge': 0}},
'when': {
'schema': {
'type': 'default',
'schema': {'type': 'nullable', 'schema': {'type': 'datetime'}},
'default': None,
}
},
'mistakes': {
'schema': {
'type': 'list',
'items_schema': {
'type': 'tuple',
'mode': 'positional',
'items_schema': [{'type': 'timedelta'}, {'type': 'str'}]
}
}
},
},
}
})
some_data = {
'title': "How Pydantic V2 leverages Rust's Superpowers",
'attendance': '100',
'when': '2023-04-22T12:15:00',
'mistakes': [
('00:00:00', 'Screen mirroring confusion'),
('00:00:30', 'Forgot to turn on the mic'),
('00:25:00', 'Too short'),
('00:40:00', 'Too long!'),
],
}
talk = talk_validator.validate_python(some_data)
print(talk.mistakes)
"""
[
(datetime.timedelta(0), 'Screen mirroring confusion'),
(datetime.timedelta(seconds=30), 'Forgot to turn on the mic'),
(datetime.timedelta(seconds=1500), 'Too short'),
(datetime.timedelta(seconds=2400), 'Too long!')
]
"""
class Talk(BaseModel):
title: Annotated[
str,
Maxlen(100)
]
attendance: PosInt
when: datetime | None = None
mistakes: list[
tuple[timedelta, str]
]
Pydantic V2 Architecture
Read type hints
construct a "core schema"
pydantic
(pure python)
pydantic-core
(binary + stubs + core-schema)
process core schema
return SchemaValidator
Receive data
call schema_validator(data)
run validator
return the result of validation
Rust - the bad
from __future__ import annotations
from pydantic import BaseModel
class Foo(BaseModel):
a: int
f: list[Foo]
f = {'a': 1, 'f': []}
f['f'].append(f)
Foo(**f)
fn main() {
main();
}
RecursionError is bad, but no RecursionError is worse!
Also no multiple ownership.
Rust - the ugly
class Box:
def __init__(self, width):
self.width = width
def area(self):
return self.width ** 2
def __str__(self):
return f'Box: {self.width}'
box = Box(42)
print(f'{box}, area {box.area()}')
use std::fmt;
struct Box {
width: i64,
}
impl fmt::Display for Box {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "Box: {}", self.width)
}
}
impl Box {
fn new(width: i64) -> Self {
Self { width }
}
fn area(&self) -> i64 {
self.width * self.width
}
}
fn main() {
let b = Box::new(42);
println!("{b}, area {}", b.area());
}
Rust is significantly more verbose.
Pydantic V2
Examples
Performance
import timeit
from pydantic import BaseModel, __version__
class Model(BaseModel):
name: str
age: int
friends: list[int]
settings: dict[str, float]
data = {
'name': 'John',
'age': 42,
'friends': list(range(200)),
'settings': {f'v_{i}': i / 2.0 for i in range(50)}
}
t = timeit.timeit(
'Model(**data)',
globals={'data': data, 'Model': Model},
number=10_000,
)
print(f'version={__version__} time taken {t * 100:.2f}us')
version=1.10.4 time taken 179.81us
version=2.0a3 time taken 7.99us
22.5x speedup
Strict Mode
from pydantic import BaseModel, ValidationError
class Model(BaseModel):
model_config = dict(strict=True)
age: int
friends: tuple[int, int]
try:
Model(age='42', friends=[1, 2])
except ValidationError as e:
print(e)
"""
2 validation errors for Model
age
Input should be a valid integer [type=int_type,
input_value='42', input_type=str]
friends
Input should be a valid tuple [type=tuple_type,
input_value=[1, 2], input_type=list]
"""
print(Model(age=42, friends=(1, 2)))
#> age=42 friends=(1, 2)
AKA Pedant mode.
Builtin JSON parsing
from pydantic import BaseModel
class Model(BaseModel):
model_config = dict(strict=True)
age: int
friends: tuple[int, int]
print(Model.model_validate_json('{"age": 1, "friends": [1, 2]}'))
#> age=1 friends=(1, 2)
If you're going to be a pedant, you better be right.
Also gives us:
- Big performance improvement without 3rd party parsing library
- Custom Errors (WIP)
- Line numbers in errors (in future)
Wrap Validators
from pydantic import BaseModel, field_validator
class Model(BaseModel):
x: int
@field_validator('x', mode='wrap')
def validate_x(cls, v, handler):
if v == 'one':
return 1
try:
return handler(v)
except ValueError:
return -999
print(Model(x='one'))
#> x=1
print(Model(x=2))
#> x=2
print(Model(x='three'))
#> x=-999
- Logic before
- Logic after
- Catch errors - new error, or default
AKA "The Onion"
Recursive Models
from __future__ import annotations
from pydantic import BaseModel, Field, ValidationError
class Branch(BaseModel):
length: float
branches: list[Branch] = Field(default_factory=list)
print(Branch(length=1, branches=[{'length': 2}]))
#> length=1.0 branches=[Branch(length=2.0, branches=[])]
b = {'length': 1, 'branches': []}
b['branches'].append(b)
try:
Branch.model_validate(b)
except ValidationError as e:
print(e)
"""
1 validation error for Branch
branches.0
Recursion error - cyclic reference detected
[type=recursion_loop,
input_value={'length': 1, 'branches': [{...}]},
input_type=dict]
"""
Alias Paths
from pydantic import BaseModel, Field, AliasPath, AliasChoices
class MyModel(BaseModel):
a: int = Field(validation_alias=AliasPath('foo', 1, 'bar'))
b: str = Field(validation_alias=AliasChoices('x', 'y'))
m = MyModel.model_validate(
{
'foo': [{'bar': 0}, {'bar': 1}],
'y': 'Y',
}
)
print(m)
#> a=1 b='Y'
Generics
from typing import Generic, TypeVar
from pydantic import BaseModel
DataT = TypeVar('DataT')
class Response(BaseModel, Generic[DataT]):
error: int | None = None
data: DataT | None = None
class Profile(BaseModel):
name: str
email: str
def my_profile_view(id: int) -> Response[Profile]:
if id == 42:
return Response[Profile](data={'name': 'John', 'email': 'john@example.com'})
else:
return Response[Profile](error=404)
print(my_profile_view(42))
#> error=None data=Profile(name='John', email='john@example.com')
Favorite = tuple[int, str]
def my_favorites_view() -> Response[list[Favorite]]:
return Response[list[Favorite]](data=[(1, 'a'), (2, 'b')])
Serialisation
from pydantic import BaseModel
class User(BaseModel):
name: str
age: int
class Profile(BaseModel):
account_id: int
user: User
user = User(name='Alice', age=1)
print(Profile(account_id=1, user=user).model_dump())
#> {'account_id': 1, 'user': {'name': 'Alice', 'age': 1}}
class AuthUser(User):
password: str
auth_user = AuthUser(name='Bob', age=2, password='very secret')
print(Profile(account_id=2, user=auth_user).model_dump())
#> {'account_id': 2, 'user': {'name': 'Bob', 'age': 2}}
Solving the "don't ask the type" problem.
Without BaseModel
from dataclasses import dataclass
from pydantic import TypeAdapter
@dataclass
class Foo:
a: int
b: int
@dataclass
class Bar:
c: int
d: int
x = TypeAdapter(Foo | Bar)
d = x.validate_json('{"a": 1, "b": 2}')
print(d)
#> Foo(a=1, b=2)
print(x.dump_json(d))
#> b'{"a":1,"b":2}'
BaseModel is still here and widely used, but no longer essentials.
Enter TypeAdapter.
Demo
- Needed to move off Google Analytics
- Record page views without a cookie
- Store in MongoDB
- End up with a big JSON file to analyse
- Want to see which pages are viewed most
Thank you
Twitter: @pydantic & @samuel_colvin
GitHub: /pydantic & /samuelcolvin
Docs: docs.pydantic.dev
We need your help:
- Try pydantic V2 beta before we release V2!
- Applications using Pydantic - come talk to me
- Are you using Pydantic to process lots of data - if so we'd love to chat to you about the commercial platform we're building
Not Rust vs. Python
But rather: Python as the user* interface for Rust.
(* by user, I mean "application developer")
I'd love to see a generation of libraries for Python (and other high level languages) built in Rust.
Rust
TLS
Routing
HTTP parsing
Validation
DB query
Serializing
Rust/C
Python
Application Logic
HTTPS request lifecycle:
100% of Developer time
=
1% of CPU cycles
...
Ok, some actual Rust...
Pydantic V2
#[enum_dispatch(CombinedValidator)]
trait Validator {
const EXPECTED_TYPE: &'static str;
fn build(schema: &PyDict, config: Option<&PyDict>) -> PyResult<CombinedValidator>;
fn validate(&self, input: &impl Input, extra: &Extra) -> ValResult<PyObject>;
}
#[enum_dispatch]
enum CombinedValidator {
Int(IntValidator),
Str(StrValidator),
TypedDict(TypedDictValidator),
Union(UnionValidator),
TaggedUnion(TaggedUnionValidator),
Nullable(NullableValidator),
// ... and 43 more
}
fn build_validator(schema: &PyDict, config: Option<&PyDict>) -> PyResult<CombinedValidator> {
let schema_type: &str = schema.get_as_req("type")?;
// really this is a clever macro to avoid the duplication
match schema_type {
IntValidator::EXPECTED_TYPE => IntValidator::build(schema, config),
StrValidator::EXPECTED_TYPE => StrValidator::build(schema, config),
TypedDictValidator::EXPECTED_TYPE => TypedDictValidator::build(schema, config),
UnionValidator::EXPECTED_TYPE => UnionValidator::build(schema, config),
TaggedUnionValidator::EXPECTED_TYPE => TaggedUnionValidator::build(schema, config),
NullableValidator::EXPECTED_TYPE => NullableValidator::build(schema, config),
// ... and 43 more
}
}
trait Input<'a> {
fn is_none(&self) -> bool;
fn strict_str(&'a self) -> ValResult<&'a str>;
fn lax_str(&'a self) -> ValResult<&'a str>;
fn validate_date(&self, strict: bool) -> ValResult<PyDatetime>;
fn strict_date(&self) -> ValResult<PyDatetime>;
// ... and 53 more
}
impl<'a> Input<'a> for PyAny {
// ...
}
impl<'a> Input<'a> for JsonInput {
// ...
}
#[pyclass]
struct SchemaValidator {
validator: CombinedValidator,
}
#[pymethods]
impl SchemaValidator {
#[new]
fn py_new(schema: &PyDict, config: Option<&PyDict>) -> PyResult<Self> {
// We also do magic/evil schema validation using pydantic-core itself
let validator = build_validator(schema, config)?;
Ok(SchemaValidator { validator })
}
fn validate_python(&self, input: &PyAny, strict: Option<bool>) -> PyResult<PyObject> {
self.validator.validate(input, &Extra::new(strict))
}
fn validate_json(
&self,
input_string: &PyString,
strict: Option<bool>,
) -> PyResult<PyObject> {
let input = parse_string(input_string)?;
self.validator.validate(&input, &Extra::new(strict))
}
}
Pydata London | Garbage in -> Pydantic -> you're golden!
By Samuel Colvin
Pydata London | Garbage in -> Pydantic -> you're golden!
- 1,619