An intro to nom

parsing made easy for Rustaceans

Rust Dublin Meetup 2024-03-06

Roberto Gambuzzi - Luciano Mammino

๐Ÿš€ย Principal Software Engineer here and there

Let's connect!ย linktr.ee/gambuzzi

Hello, I'm Roberto...

๐Ÿ’ค Lazy by nature, Pythonic by choice

๐Ÿฆ€ Old enough to be rusted

๐Ÿ‘‹ I'm Lucianoย (๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ•๐Ÿ๐ŸคŒ)

๐Ÿ‘จโ€๐Ÿ’ป Senior Architect @ fourTheorem

๐Ÿ“” Co-Author of Node.js Design Patternsย  ๐Ÿ‘‰

Let's connect!ย linktr.ee/loige

What the heck is a parser? ๐Ÿค“

What the heck is a parser? ๐Ÿค“

A program that can turn text (or bytes) into structured information

What the heck is a parser? ๐Ÿค“

A program that can turn text (or bytes) into structured information

"Point: (22, 17, -11)"

struct Point3D {
  x: i32,
  y: i32,
  z: i32,
}

Naive approach

String split like a mad-man ๐Ÿคช

[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"

label

remainder

Naive approach

String split like a mad-man ๐Ÿคช

[label, remainder] = split(input, ": ", 2)
input = "Point:ย (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)

Naive approach

String split like a mad-man ๐Ÿคช

[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
[numbers,] = split(remainder, ")", 2)

Naive approach

String split like a mad-man ๐Ÿคช

[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
[numbers,] = split(remainder, ")", 2)
[x, y, z] = split(numbers, ", ", 3)

Problems with this approach

  1. Not always suitable (you need "delimiters")
  2. Simple for simple stuff, increasingly hard for "realistic" use cases
  3. Hard to handle errors consistently

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/ /

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: /

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \(/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+)/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+)/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/

Capture groups

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$/

Named capture groups

x

y

z

$ cargo add regex

yes, the Rust standard library doesn't have built-in support for regex (yet)! ๐Ÿคทโ€โ™€๏ธ

*

use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}
use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}
use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}

Problems with this approach

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

โ€” Jamie Zawinski

Problems with this approach

  • Regex are hard to debug:
    • it's either a full match (with captures)
    • or nothing!
  • Learning to write regex takes a lot of practice (possibly years!)
  • Regex might not be suitable for more advanced parsing cases
    • E.g. nested structures

Consuming parser

Tries to "eat and match" the source text (or bytes)

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point: (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

Parsed data

Remainder string

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point: (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point:ย (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

3. readSpaces(remainder)

" ", "(22, 17, -11)"

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point:ย (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

3. readSpaces(remainder)

" ", "(22, 17, -11)"

4. readIntTuple(remainder)

[22,17,-11], ""

Consuming parser

Handling errors: if we fail to match we have to return an error

input = "Hello World"

readNumber(input)

"Cannot match number on 'Hello World'"
$ cargo add nom
fn main() {
    let s = "Point: (22, 17, -11)";
    
    let (_, point) = parse_point(s).unwrap();
    println!("{:?}", point);
}
fn parse_point(input: &str) -> IResult<&str, Point3D> {
    // ... parsing logic goes here

    Ok((input, Point3D { x, y, z }))
}

the value to parse from

A nom Result type

Input type

Output type

fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    
    // ...

    Ok((input, Point3D { x, y, z }))
}

nom combinator: exact match

what to match

the value to match against

If it fails to match

we propagate the error

the parsed value

(in this case "Point: ")

the remainder string

Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;

    // ...

    Ok((input, Point3D { x, y, z }))
}
Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;
    let (input, x) = i32(input)?;

    // ...

    Ok((input, Point3D { x, y, z }))
}

combinator that parses a sign (+ or -) and sequence of digits and converts them to a i32

first coordinate

Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;
    let (input, x) = i32(input)?;
	let (input, _) = tag(", ")(input)?;
    let (input, y) = i32(input)?;
    let (input, _) = tag(", ")(input)?;
    let (input, z) = i32(input)?;
    let (input, _) = tag(")")(input)?;

    Ok((input, Point3D { x, y, z }))
}
Point: (22, 17, -11)

With nom you generally break parsers into smaller and reusable parsers ๐Ÿค

/// parses "," followed by 0 or more spaces
fn separator(input: &str) -> IResult<&str, ()> {
    let (input, _) = pair(tag(","), space0)(input)?;
    Ok((input, ()))
}

nom combinator that applies 2 parsers in sequence and returns a tuple with the 2 parsed values

/// parses 3 numbers separated by a separator (e.g. "22, 17, -11")
fn parse_coordinates(input: &str) -> IResult<&str, (i32, i32, i32)> {
    let (input, (x, _, y, _, z)) = tuple((
        i32,
        separator,
        i32,
        separator,
        i32
    ))(input)?;

    Ok((input, (x, y, z)))
}

nom combinator that applies a sequence of parsers in sequence and returns a tuple with the parsed values

Note how we are reusing the parser we just created!

fn parse_point2(input: &str) -> IResult<&str, Point3D> {
    let (input, _) = tag("Point: ")(input)?;
    let (input, (x, y, z)) = delimited(
        tag("("),
        parse_coordinates,
        tag(")")
    )(input)?;
    
    Ok((input, Point3D { x, y, z }))
}

We already learned about these useful nomย combinators:

  • tag: exact string match
  • i32: parse a number from text as i32
  • space0: matches 0 or more spaces
  • pair: apply 2 parsers in sequence
  • tuple: apply N parsers in sequence
  • delimited: combines 3 parsers to extract a value wrapped around 2 delimiters

Let's learn more by trying to parse RESP
the Redis Serialization Protocol ๐Ÿค 

RESP Primer (spec)

  • Binary protocol that uses control sequences encoded in standard ASCII
  • The \r\nย (CRLF) is the protocol's terminator, which always separates its parts
  • The first byte in a payload identifies its type. Subsequent bytes constitute the type's contents.
  • Different data types (simple, bulk, or aggregate)

RESP Primer (spec)

RESP data type First byte
Simple strings +
Simple Errors -
Integers :
Bulk strings $
Arrays *
Nulls _
Booleans #
Doubles ,
Big numbers (
Bulk errors !
Verbatim strings =
Maps %
Sets ~
Pushes >

Example messages

+OK\r\n
-ERR unknown command 'asdf'\r\n
:1000\r\n
$5\r\nhello\r\n
*2\r\n$5\r\nhello\r\n$5\r\nworld\r\n
,1.23\r\n
(3492890328409238509324850943850943825024385\r\n
%2\r\n+first\r\n:1\r\n+second\r\n:2\r\n

Simple string

Simple error

Integer

Bulk string

Array

Double

Big number

Map

["hello", "world"]
{"first": 1, "second": 2}
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub enum Value<'a> {
    SimpleString(&'a str),
    SimpleError(&'a str),
    Integer(i64),
    BulkString(&'a str),
    Array(Vec<Value<'a>>),
    Null,
    Boolean(bool),
    Double(String),
    BigNumber(&'a str),
    BulkError(&'a str),
    VerbatimString(&'a str, &'a str),
    Map(Vec<Value<'a>>, Vec<Value<'a>>),
    Set(BTreeSet<Value<'a>>),
    Pushes(Vec<Value<'a>>),
}

Recursive data type

pub fn parse_value(input: &str) -> IResult<&str, Value> {
    let (input, type_char) = one_of("+-:$*_#,(!=%~>")(input)?;
    let parser = match type_char {
        '+' => parse_simple_string,
        '-' => parse_simple_error,
        ':' => parse_integer,
        '$' => parse_bulk_string,
        '*' => parse_array,
        '_' => parse_null,
        '#' => parse_bool,
        ',' => parse_double,
        '(' => parse_bignumber,
        '!' => parse_bulk_error,
        '=' => parse_verbatim_string,
        '%' => parse_map,
        '~' => parse_set,
        '>' => parse_push,
        _ => unreachable!("Invalid type"),
    };

    parser(input)
}

Alt-ernative approach ๐Ÿ™„

pub fn parse_value(input: &str) -> IResult<&str, Value> {
    alt((
        parse_simple_string,
        parse_simple_error,
        parse_integer,
        parse_bulk_string,
        parse_array,
        parse_null,
        parse_bool,
        parse_double,
        parse_bignumber,
        parse_bulk_error,
        parse_verbatim_string,
        parse_map,
        parse_set,
        parse_pushes,
    ))(input)
}
fn crlf(input: &str) -> IResult<&str, &str> {
    tag("\r\n")(input)
}

fn parse_simple_string(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("+")(input)?;
    let (input, value) = terminated(
        take_while(|c| c != '\r' && c != '\n'), 
        crlf
    )(input)?;
    Ok((input, Value::SimpleString(value)))
}
+Simple String Example\r\n
fn parse_bulk_string(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("$")(input)?;
    let (input, length) = terminated(u32, crlf)(input)?;
    let (input, value) = terminated(
        take(length as usize),
        crlf
    )(input)?;
    Ok((input, Value::BulkString(value)))
}
$19\r\nBulk String Example\r\n
fn parse_array(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("*")(input)?;
    let (input, length) = terminated(u32, crlf)(input)?;
    let (input, values) = count(parse_value, length as usize)(input)?;
    Ok((input, Value::Array(values)))
}
*2\r\n$5\r\nExample\r\n$5\r\nArray\r\n

Other useful nomย combinators:

  • one_of: match one char from a list of allowed ones
  • terminated: applies a parser and then a second one on the remaining input but returns only the value of the first parser.
  • eof: checks that the remaining input has length 0
  • alt: tries a sequence of parsers until one matches
  • take_while: take characters until a given condition matches
  • take: takes N characters
  • count: repeats a parser N times and collects the results in a vector.

Can we parse binaryย file formats? ๐Ÿค”

UINT8[80]    โ€“ Header                 -     80 bytes
UINT32       โ€“ Number of triangles    -      4 bytes
foreach triangle                      - 50 bytes:
    REAL32[3] โ€“ Normal vector             - 12 bytes
    REAL32[3] โ€“ Vertex 1                  - 12 bytes
    REAL32[3] โ€“ Vertex 2                  - 12 bytes
    REAL32[3] โ€“ Vertex 3                  - 12 bytes
    UINT16    โ€“ Attribute byte count      -  2 bytes
end

STL file format

๐Ÿคทโ€โ™‚๏ธ Did you know there's a monumentย of this teapot in Dublin?

use std::{fs, io};

use nom::bytes::complete::take;
use nom::multi::count;
use nom::number::complete::{le_f32, le_u16, le_u32};
use nom::sequence::tuple;
use nom::IResult;

#[derive(Debug)]
struct Triangle {
    normal_vector: [f32; 3],   // REAL32[3] - Normal vector
    vertex_1: [f32; 3],        // REAL32[3] - Vertex 1
    vertex_2: [f32; 3],        // REAL32[3] - Vertex 2
    vertex_3: [f32; 3],        // REAL32[3] - Vertex 3
    attribute_byte_count: u16, // UINT16 - Attribute byte count
}

#[derive(Debug)]
pub struct StlFile<'a> {
    header: &'a [u8],         // UINT8[80] - Header
    number_of_triangles: u32, // UINT32 - Number of triangles
                              // little endian
    triangles: Vec<Triangle>, // Variable number of triangles
}

fn parse_triangle(input: &[u8]) -> IResult<&[u8], Triangle> {
    let mut three_floats = tuple((le_f32, le_f32, le_f32));
    let (input, normal_vector) = three_floats(input)?;
    let (input, vertex_1) = three_floats(input)?;
    let (input, vertex_2) = three_floats(input)?;
    let (input, vertex_3) = three_floats(input)?;
    let (input, attribute_byte_count) = le_u16(input)?;
    Ok((
        input,
        Triangle {
            normal_vector: [normal_vector.0, normal_vector.1, 
                            normal_vector.2],
            vertex_1: [vertex_1.0, vertex_1.1, vertex_1.2],
            vertex_2: [vertex_2.0, vertex_2.1, vertex_2.2],
            vertex_3: [vertex_3.0, vertex_3.1, vertex_3.2],
            attribute_byte_count,
        },
    ))
}

pub fn parse_stl(data: &[u8]) -> IResult<&[u8], StlFile> {
    let (data, header) = take(80usize)(data)?;
    let (data, number_of_triangles) = le_u32(data)?;
    let (data, triangles) = count(parse_triangle,
                            number_of_triangles as usize)(data)?;
    Ok((
        data,
        StlFile {
            header,
            number_of_triangles,
            triangles,
        },
    ))
}

fn main() -> io::Result<()> {
    let file_path = "example.stl";
    let data: Vec<u8> = fs::read(file_path)?;
    let bin_data: &[u8] = &data;
    let (_, stl_file) = parse_stl(bin_data).unwrap();
    dbg!("{:?}", stl_file);

    Ok(())
}

Result

[src/main.rs:61:5] stl_file = StlFile {
    header: [
        34,
        74,
        101,
        119,
        101,
    ...
        0,
        0,
        0,
        0,
        0,
        0,
    ],
    number_of_triangles: 2466,
    triangles: [
        Triangle {
            normal_vector: [
                -0.0,
                0.0,
                1.0,
            ],
            vertex_1: [
                146.41861,
                -0.44428897,
                0.0,
            ],
            vertex_2: [
                151.41861,
                -0.44428897,
                0.0,
            ],
            vertex_3: [
                146.56938,
                0.4107614,
                0.0,
            ],
            attribute_byte_count: 0,
        },
        Triangle {
...

It's a wrap! ๐ŸŒฏ

  • If you need to do parsing, string splitting is an option but it gets messy really quick
  • Regex can be a better approach, but they are hard to learn, debug, and to get right
  • `nom` offers a more interesting approach through parser combinators
  • Once you learn how to create a parser function and some of the combinators available you are good to go!
  • Nom also allows you to parse binary data!

Bonus content ๐Ÿ˜ฑ

Why is it called nom? ๐Ÿค”

Why is it called nom? ๐Ÿค”

Bonus content ๐Ÿ˜ฑ

nom will happily take a byte out of your files :)

Alternative parser combinator crates

  • winnow: fork of nom. It seems to be updated more frequently and the team claims they are working to provide a better DX.
  • chumsky: Better error reporting and error recovery. Favors a function chaining style rather than composition.

Links

THANK YOU!

Questions?

An intro to `nom`, parsing made easy for Rustaceans - Rust Dublin

By Luciano Mammino

An intro to `nom`, parsing made easy for Rustaceans - Rust Dublin

In this talk we will give you a pragmatic introduction to `nom` , a famous Rust parser combinator crate, which, in other words, is a tool that makes it easy to create parsers even for more advanced use cases. We will start by showcasing common suboptimal ways to do parsing including string splitting and using regexes. Then we will introduce nom and the main concepts it builds on, including its most commonly used functions. Finally, we will present several examples, from a few simple ones to building an entire parser for RESP, the Redis Serialization Protocol. If we do a good job, you'll probably forget about regex :)

  • 611