Native Rust modules for Python

whoami

Open-source:

  • ODMantic (an "ORM" for MongoDB)
  • CPython typo contributor

 

Building

@art049

Arthur Pastel

Software Engineer

Python:

  • Embedded/AI
  • Backend (FastAPI ❤️)

How I got started with Rust

  1. Rust by example / rustlings / Rust Book
  2. Advent of Code
  3. Small PoCs (like the example we will see)
  4. Shipped my first rust code in production 🎉

Why Rust?

Pros:

  • Performance
  • The compiler

  • Memory safety

  • Package manager (cargo)

Cons:

  • Steep Learning curve
  • Compiled language
    • Additional CI/CD steps
    • Compilation time
  • Verbosity

Rust Verbosity

def increment_all(nums):
    for i in range(len(nums)):
        nums[i] += 1
fn increment_all(nums: &mut Vec<i32>) {
    for num in nums.iter_mut() {
        *num += 1;
    }
}

Using Rust within Python

  • Best of both worlds:
    • Python for expressiveness
    • Rust for performance/safety
  • Incremental transition to Rust

Let's build!

Challenge:

  • Parsing a graph
     
  • A huge graph: more than 50k nodes
     
  • In a reasonable amount of time
     

Let's build!

A DOT graph format parser

graph MyCoolGraph {
    a -- b -- c;
    b -- d;
}

Parsing steps

  • Tokenizing: spelling
  • Parsing: grammar

The Tokenizer

The Tokens

graph CoolGraph {
    a -- b -- c ;
    b -- d ;
}

Graph

Identifier

Edge

SemiColon

LeftBracket

RightBracket

The Tokens

from dataclasses import dataclass
from enum import Enum, auto

class BasicToken(Enum):
    Graph = auto()
    LeftBracket = auto()
    RightBracket = auto()
    Semicolon = auto()
    Edge = auto()

@dataclass
class IdentifierToken:
    name: str

Token = BasicToken | IdentifierToken

The Tokens

enum Token {
    Graph,
    LeftBracket,
    RightBracket,
    Semicolon,
    Edge,
    Identifier(String),
}

Rust Compound Types

  • Tuples / Arrays / Vectors
  • Enum
    • Option<T>
    • Result<T, Error>
  • Structs
// (i32, f64, u64)
let t = (500, 6.4, 1u64);

// [i32; 5] fixed size of 5 elements
let a = [1, 2, 3, 4, 5];

// Vec<i32> vector of i32s (dynamic)
let a = vec![1, 2, 3, 4, 5];


//This is a builtin enum
enum Option<T> {
    Some(T),
    None,
}

// Option<i32> type is inferred
let a = Some(3);
// The option type is specified
let b: Option<u32> = None;


//This is a builtin enum too
enum Result<T, E> {
    Ok(T),
    Err(E),
}

fn div(a: i32, b: i32) -> Result<i32, &str>{
    if b != 0 {
        Ok(a / b)
    } else {
        Err("Cannot divide by zero")
    }
}


struct User {
    active: bool,
    name: Option<String>,
    sign_in_count: u64,
}

// Option<User>
let u = Option::Some(
    User {
        active: true,
        name: Some("john".to_string()),
        sign_in_count: 1,
    }
);

Back to the Tokens

enum Token {
    Graph,
    LeftBracket,
    RightBracket,
    Semicolon,
    Edge,
    Identifier(String),
}

The Tokens

def word_to_token(word: str) -> Token:
    if word == "graph":
        return BasicToken.DiGraph
    elif word == "{":
        return BasicToken.LeftBracket
    elif word == "}":
        return BasicToken.RightBracket
    elif word == ";":
        return BasicToken.Semicolon
    elif word == "--":
        return BasicToken.Edge
    else:
        return IdentifierToken(word)

The Tokens

def word_to_token(word: str) -> Token:
    match word:
        case "graph":
            return BasicToken.DiGraph
        case "{":
            return BasicToken.LeftBracket
        case "}":
            return BasicToken.RightBracket
        case ";":
            return BasicToken.Semicolon
        case "--":
            return BasicToken.Edge
        case _:
            return IdentifierToken(word)

Introduced with Python 3.10 (PEP 634)

⚠️ not backward compatible

The Tokens

fn word_to_token(word: &str) -> Token {
    match word {
        "graph" => Token::Graph,
        "{" => Token::LeftBracket,
        "}" => Token::RightBracket,
        ";" => Token::Semicolon,
        "->" => Token::DirectedEdgeOp,
        "--" => Token::UndirectedEdgeOp,
        _ => Token::Identifier(word.to_string()),
    }
}

We have the Tokens

What's next ?

Graph

Identifier

Edge

SemiColon

LeftBracket

RightBracket

The Grammar

The Grammar

The Grammar

The State Machine

Identifier

Edge

SemiColon

Graph

LeftBracket

RightBracket

Identifier

Identifier

Building the graph

class ParserState(Enum):
    Start = auto()
    ExpectGraphName = auto()
    ExpectLBracket = auto()
    ExpectNodeName = auto()
    ExpectEdgeOrSemicolon = auto()
    ExpectNodeNameOrRBracket = auto()
    End = auto()

class Parser:
    def __init__(self):
        self.state = ParserState.Start
        # ...

    def parse_token(self, token: Token):
        """Transition to the next state"""
        match (self.state, token):
            # first Graph token
            case (ParserState.Start, BasicToken.GRAPH):
                self.state = ParserState.ExpectGraphName
            
            # Identifier defining the name of the graph
            case (ParserState.ExpectGraphName, IdentifierToken(name)):
                self.graph_name = name
                self.state = ParserState.ExpectLBracket
                
            # ...

            # Error cases (unexpected tokens)
            case _:
                raise Exception(f"Unexpected {token} in state {self.state}")

Building the graph

enum ParserState {
    Start,
    ExpectGraphName,
    ExpectLBracket,
    ExpectNodeName,
    ExpectEdgeOrSemicolon,
    ExpectNodeNameOrRBracket,
    End,
}

pub struct Parser {
    state: ParserState,
    // ...
}

impl Parser {
    pub fn new() -> Self {
        Self {
            state: ParserState::Start,
            // ...
        }
    }

    fn parse_token(&mut self, token: Token) {
        match (self.state, token) {
            // Parse the first Graph token
            (ParserState::Start, Token::Graph) => {
                self.state = ParserState::ExpectGraphName;
            }
            
            // Parse the graph name
            (ParserState::ExpectGraphName, Token::Identifier(name)) => {
                self.graph_name = name;
                self.state = ParserState::ExpectLBracket;
            }
            
            // ...

            // Error
            (state, token) => {
                panic!("Unexpected token {:?} in state {:?}", token, state);
            }
        }
    }
}

And... are we done?

(almost)

Bindings and Toolchain

  • Bindings:
    • Py03

  • Build backend:
    • maturin: standard way to ship rust modules

    • rustimport: handy for quick PoC

Bindings and Toolchain

// rustimport:pyo3
use pyo3::prelude::*;

#[pyfunction]
fn square(x: i32) -> i32 {
    x * x
}
>>> import rustimport.import_hook
>>> import somecode  # Compile the module
>>> somecode.square(9)
81
somecode.rs

Benefits

  • Test quickly
  • No extra build steps
  • Can still be configured
  • Easy migration to maturin

Wiring it up

#[pyclass]
pub struct Graph {
    pub graph_name: String,
    pub nodes: Vec<String>,
    pub adjacency: HashMap<usize, Vec<usize>>,
}

#[pyfunction]
pub fn parse_file(filepath: String) -> PyResult<Graph> {
    let text = std::fs::read_to_string(filepath)?;
    let words_it = split_words(&text).into_iter();
    let mut token_it = words_it.map(word_to_token);
    let mut parser = Parser::new();
    let graph = parser.parse(&mut token_it);
    return Ok(graph)
}

Wiring it up (typing)

class Graph:
    graph_name: str
    nodes: list[str]
    adjacency: dict[int, list[int]]
      
 def parse_file(filepath: str) -> Graph: ...
parser.pyi

⚠️ not automated (yet)

Should I write tests in Rust?

import rustimport.import_hook
from pydot_rs import parse_file


def test_parsing_undir():
    graph = parse_file("samples/undir.dot")
    assert len(graph.nodes) == 4
    assert set(graph.nodes) == {"a", "b", "c", "d"}

Measuring performance

import rustimport.import_hook
from pydot_rs import parse_file


def test_parsing_undir(benchmark):
    graph = benchmark(parse_file, "samples/undir.dot")
    assert len(graph.nodes) == 4
    assert set(graph.nodes) == {"a", "b", "c", "d"}

with pytest-codspeed

Performance

  • Performance measurement before
    deployment
     
  • More stable than time-based measurements
     
  • CI and Pull Request integration
     
  • Free for Open-Source

Rust in Python Projects

  • pydantic: data validation library (with pydantic-core)
  • robyn: a web framework with a Rust runtime
  • tsdownsample: time series downsampling algorithms

 

pydantic-core

Initial speedup:

Improvements on ome-types

PR migrating to Pydantic V2

Thank you!

@art049

linkedin.com/in/arthurpastel

Arthur

Adrien

Come and chat with us!

Links

Native Rust modules for Python

By Arthur Pastel

Native Rust modules for Python

EuroPython 2023 Talk

  • 233