An intro to nom
parsing made easy for Rustaceans
Rust Dublin Meetup 2024-03-06
Roberto Gambuzzi - Luciano Mammino
๐ย Principal Software Engineer here and there
Let's connect!ย linktr.ee/gambuzzi
Hello, I'm Roberto...
๐ค Lazy by nature, Pythonic by choice
๐ฆ Old enough to be rusted
๐ I'm Lucianoย (๐ฎ๐น๐๐๐ค)
๐จโ๐ป Senior Architect @ fourTheorem
๐ Co-Author of Node.js Design Patternsย ๐
Let's connect!ย linktr.ee/loige
What the heck is a parser? ๐ค
What the heck is a parser? ๐ค
A program that can turn text (or bytes) into structured information
What the heck is a parser? ๐ค
A program that can turn text (or bytes) into structured information
"Point: (22, 17, -11)"
struct Point3D {
x: i32,
y: i32,
z: i32,
}
Naive approach
String split like a mad-man ๐คช
[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
label
remainder
Naive approach
String split like a mad-man ๐คช
[label, remainder] = split(input, ": ", 2)
input = "Point:ย (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
Naive approach
String split like a mad-man ๐คช
[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
[numbers,] = split(remainder, ")", 2)
Naive approach
String split like a mad-man ๐คช
[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
[numbers,] = split(remainder, ")", 2)
[x, y, z] = split(numbers, ", ", 3)
Problems with this approach
- Not always suitable (you need "delimiters")
- Simple for simple stuff, increasingly hard for "realistic" use cases
- Hard to handle errors consistently
Using a Regex ๐ง
Point: (22, 17, -11)
/ /
Using a Regex ๐ง
Point: (22, 17, -11)
/^Point: /
Using a Regex ๐ง
Point: (22, 17, -11)
/^Point: \(/
Using a Regex ๐ง
Point: (22, 17, -11)
/^Point: \((-?\d+)/
Using a Regex ๐ง
Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+)/
Using a Regex ๐ง
Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)/
Using a Regex ๐ง
Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/
Using a Regex ๐ง
Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/
Capture groups
Using a Regex ๐ง
Point: (22, 17, -11)
/^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$/
Named capture groups
x
y
z
$ cargo add regex
yes, the Rust standard library doesn't have built-in support for regex (yet)! ๐คทโโ๏ธ
*
use regex::Regex;
#[derive(Debug)]
struct Point3D {
x: i32,
y: i32,
z: i32,
}
fn main() {
let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
let caps = re.captures("Point: (22, 17, -11)").unwrap();
let point = Point3D {
x: caps["x"].parse().unwrap(),
y: caps["y"].parse().unwrap(),
z: caps["z"].parse().unwrap(),
};
println!("{:?}", point);
}
use regex::Regex;
#[derive(Debug)]
struct Point3D {
x: i32,
y: i32,
z: i32,
}
fn main() {
let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
let caps = re.captures("Point: (22, 17, -11)").unwrap();
let point = Point3D {
x: caps["x"].parse().unwrap(),
y: caps["y"].parse().unwrap(),
z: caps["z"].parse().unwrap(),
};
println!("{:?}", point);
}
use regex::Regex;
#[derive(Debug)]
struct Point3D {
x: i32,
y: i32,
z: i32,
}
fn main() {
let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
let caps = re.captures("Point: (22, 17, -11)").unwrap();
let point = Point3D {
x: caps["x"].parse().unwrap(),
y: caps["y"].parse().unwrap(),
z: caps["z"].parse().unwrap(),
};
println!("{:?}", point);
}
Problems with this approach
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
โ Jamie Zawinski
Problems with this approach
- Regex are hard to debug:
- it's either a full match (with captures)
- or nothing!
- Learning to write regex takes a lot of practice (possibly years!)
- Regex might not be suitable for more advanced parsing cases
- E.g. nested structures
Consuming parser
Tries to "eat and match" the source text (or bytes)
Consuming parser
Tries to "eat and match" the source text (or bytes)
input = "Point: (22, 17, -11)"
1. readLabel(input)
"Point", ": (22, 17, -11)"
Parsed data
Remainder string
Consuming parser
Tries to "eat and match" the source text (or bytes)
input = "Point: (22, 17, -11)"
1. readLabel(input)
"Point", ": (22, 17, -11)"
2. readSeparator(remainder)
":", " (22, 17, -11)"
Consuming parser
Tries to "eat and match" the source text (or bytes)
input = "Point:ย (22, 17, -11)"
1. readLabel(input)
"Point", ": (22, 17, -11)"
2. readSeparator(remainder)
":", " (22, 17, -11)"
3. readSpaces(remainder)
" ", "(22, 17, -11)"
Consuming parser
Tries to "eat and match" the source text (or bytes)
input = "Point:ย (22, 17, -11)"
1. readLabel(input)
"Point", ": (22, 17, -11)"
2. readSeparator(remainder)
":", " (22, 17, -11)"
3. readSpaces(remainder)
" ", "(22, 17, -11)"
4. readIntTuple(remainder)
[22,17,-11], ""
Consuming parser
Handling errors: if we fail to match we have to return an error
input = "Hello World"
readNumber(input)
"Cannot match number on 'Hello World'"
$ cargo add nom
fn main() {
let s = "Point: (22, 17, -11)";
let (_, point) = parse_point(s).unwrap();
println!("{:?}", point);
}
fn parse_point(input: &str) -> IResult<&str, Point3D> {
// ... parsing logic goes here
Ok((input, Point3D { x, y, z }))
}
the value to parse from
A nom Result type
Input type
Output type
fn parse_point(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
// ...
Ok((input, Point3D { x, y, z }))
}
nom combinator: exact match
what to match
the value to match against
If it fails to match
we propagate the error
the parsed value
(in this case "Point: ")
the remainder string
Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
let (input, _) = tag("(")(input)?;
// ...
Ok((input, Point3D { x, y, z }))
}
Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
let (input, _) = tag("(")(input)?;
let (input, x) = i32(input)?;
// ...
Ok((input, Point3D { x, y, z }))
}
combinator that parses a sign (+ or -) and sequence of digits and converts them to a i32
first coordinate
Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
let (input, _) = tag("(")(input)?;
let (input, x) = i32(input)?;
let (input, _) = tag(", ")(input)?;
let (input, y) = i32(input)?;
let (input, _) = tag(", ")(input)?;
let (input, z) = i32(input)?;
let (input, _) = tag(")")(input)?;
Ok((input, Point3D { x, y, z }))
}
Point: (22, 17, -11)
With nom you generally break parsers into smaller and reusable parsers ๐ค
/// parses "," followed by 0 or more spaces
fn separator(input: &str) -> IResult<&str, ()> {
let (input, _) = pair(tag(","), space0)(input)?;
Ok((input, ()))
}
nom combinator that applies 2 parsers in sequence and returns a tuple with the 2 parsed values
/// parses 3 numbers separated by a separator (e.g. "22, 17, -11")
fn parse_coordinates(input: &str) -> IResult<&str, (i32, i32, i32)> {
let (input, (x, _, y, _, z)) = tuple((
i32,
separator,
i32,
separator,
i32
))(input)?;
Ok((input, (x, y, z)))
}
nom combinator that applies a sequence of parsers in sequence and returns a tuple with the parsed values
Note how we are reusing the parser we just created!
fn parse_point2(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
let (input, (x, y, z)) = delimited(
tag("("),
parse_coordinates,
tag(")")
)(input)?;
Ok((input, Point3D { x, y, z }))
}
We already learned about these useful nomย combinators:
- tag: exact string match
- i32: parse a number from text as i32
- space0: matches 0 or more spaces
- pair: apply 2 parsers in sequence
- tuple: apply N parsers in sequence
- delimited: combines 3 parsers to extract a value wrapped around 2 delimiters
Let's learn more by trying to parse RESP
the Redis Serialization Protocol ๐ค
RESP Primer (spec)
- Binary protocol that uses control sequences encoded in standard ASCII
- The \r\nย (CRLF) is the protocol's terminator, which always separates its parts
- The first byte in a payload identifies its type. Subsequent bytes constitute the type's contents.
- Different data types (simple, bulk, or aggregate)
RESP Primer (spec)
RESP data type | First byte |
---|---|
Simple strings | + |
Simple Errors | - |
Integers | : |
Bulk strings | $ |
Arrays | * |
Nulls | _ |
Booleans | # |
Doubles | , |
Big numbers | ( |
Bulk errors | ! |
Verbatim strings | = |
Maps | % |
Sets | ~ |
Pushes | > |
Example messages
+OK\r\n
-ERR unknown command 'asdf'\r\n
:1000\r\n
$5\r\nhello\r\n
*2\r\n$5\r\nhello\r\n$5\r\nworld\r\n
,1.23\r\n
(3492890328409238509324850943850943825024385\r\n
%2\r\n+first\r\n:1\r\n+second\r\n:2\r\n
Simple string
Simple error
Integer
Bulk string
Array
Double
Big number
Map
["hello", "world"]
{"first": 1, "second": 2}
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub enum Value<'a> {
SimpleString(&'a str),
SimpleError(&'a str),
Integer(i64),
BulkString(&'a str),
Array(Vec<Value<'a>>),
Null,
Boolean(bool),
Double(String),
BigNumber(&'a str),
BulkError(&'a str),
VerbatimString(&'a str, &'a str),
Map(Vec<Value<'a>>, Vec<Value<'a>>),
Set(BTreeSet<Value<'a>>),
Pushes(Vec<Value<'a>>),
}
Recursive data type
pub fn parse_value(input: &str) -> IResult<&str, Value> {
let (input, type_char) = one_of("+-:$*_#,(!=%~>")(input)?;
let parser = match type_char {
'+' => parse_simple_string,
'-' => parse_simple_error,
':' => parse_integer,
'$' => parse_bulk_string,
'*' => parse_array,
'_' => parse_null,
'#' => parse_bool,
',' => parse_double,
'(' => parse_bignumber,
'!' => parse_bulk_error,
'=' => parse_verbatim_string,
'%' => parse_map,
'~' => parse_set,
'>' => parse_push,
_ => unreachable!("Invalid type"),
};
parser(input)
}
Alt-ernative approach ๐
pub fn parse_value(input: &str) -> IResult<&str, Value> {
alt((
parse_simple_string,
parse_simple_error,
parse_integer,
parse_bulk_string,
parse_array,
parse_null,
parse_bool,
parse_double,
parse_bignumber,
parse_bulk_error,
parse_verbatim_string,
parse_map,
parse_set,
parse_pushes,
))(input)
}
fn crlf(input: &str) -> IResult<&str, &str> {
tag("\r\n")(input)
}
fn parse_simple_string(input: &str) -> IResult<&str, Value> {
let (input, _) = tag("+")(input)?;
let (input, value) = terminated(
take_while(|c| c != '\r' && c != '\n'),
crlf
)(input)?;
Ok((input, Value::SimpleString(value)))
}
+Simple String Example\r\n
fn parse_bulk_string(input: &str) -> IResult<&str, Value> {
let (input, _) = tag("$")(input)?;
let (input, length) = terminated(u32, crlf)(input)?;
let (input, value) = terminated(
take(length as usize),
crlf
)(input)?;
Ok((input, Value::BulkString(value)))
}
$19\r\nBulk String Example\r\n
fn parse_array(input: &str) -> IResult<&str, Value> {
let (input, _) = tag("*")(input)?;
let (input, length) = terminated(u32, crlf)(input)?;
let (input, values) = count(parse_value, length as usize)(input)?;
Ok((input, Value::Array(values)))
}
*2\r\n$5\r\nExample\r\n$5\r\nArray\r\n
Other useful nomย combinators:
- one_of: match one char from a list of allowed ones
- terminated: applies a parser and then a second one on the remaining input but returns only the value of the first parser.
- eof: checks that the remaining input has length 0
- alt: tries a sequence of parsers until one matches
- take_while: take characters until a given condition matches
- take: takes N characters
- count: repeats a parser N times and collects the results in a vector.
Can we parse binaryย file formats? ๐ค
UINT8[80] โ Header - 80 bytes
UINT32 โ Number of triangles - 4 bytes
foreach triangle - 50 bytes:
REAL32[3] โ Normal vector - 12 bytes
REAL32[3] โ Vertex 1 - 12 bytes
REAL32[3] โ Vertex 2 - 12 bytes
REAL32[3] โ Vertex 3 - 12 bytes
UINT16 โ Attribute byte count - 2 bytes
end
STL file format
๐คทโโ๏ธ Did you know there's a monumentย of this teapot in Dublin?
use std::{fs, io};
use nom::bytes::complete::take;
use nom::multi::count;
use nom::number::complete::{le_f32, le_u16, le_u32};
use nom::sequence::tuple;
use nom::IResult;
#[derive(Debug)]
struct Triangle {
normal_vector: [f32; 3], // REAL32[3] - Normal vector
vertex_1: [f32; 3], // REAL32[3] - Vertex 1
vertex_2: [f32; 3], // REAL32[3] - Vertex 2
vertex_3: [f32; 3], // REAL32[3] - Vertex 3
attribute_byte_count: u16, // UINT16 - Attribute byte count
}
#[derive(Debug)]
pub struct StlFile<'a> {
header: &'a [u8], // UINT8[80] - Header
number_of_triangles: u32, // UINT32 - Number of triangles
// little endian
triangles: Vec<Triangle>, // Variable number of triangles
}
fn parse_triangle(input: &[u8]) -> IResult<&[u8], Triangle> {
let mut three_floats = tuple((le_f32, le_f32, le_f32));
let (input, normal_vector) = three_floats(input)?;
let (input, vertex_1) = three_floats(input)?;
let (input, vertex_2) = three_floats(input)?;
let (input, vertex_3) = three_floats(input)?;
let (input, attribute_byte_count) = le_u16(input)?;
Ok((
input,
Triangle {
normal_vector: [normal_vector.0, normal_vector.1,
normal_vector.2],
vertex_1: [vertex_1.0, vertex_1.1, vertex_1.2],
vertex_2: [vertex_2.0, vertex_2.1, vertex_2.2],
vertex_3: [vertex_3.0, vertex_3.1, vertex_3.2],
attribute_byte_count,
},
))
}
pub fn parse_stl(data: &[u8]) -> IResult<&[u8], StlFile> {
let (data, header) = take(80usize)(data)?;
let (data, number_of_triangles) = le_u32(data)?;
let (data, triangles) = count(parse_triangle,
number_of_triangles as usize)(data)?;
Ok((
data,
StlFile {
header,
number_of_triangles,
triangles,
},
))
}
fn main() -> io::Result<()> {
let file_path = "example.stl";
let data: Vec<u8> = fs::read(file_path)?;
let bin_data: &[u8] = &data;
let (_, stl_file) = parse_stl(bin_data).unwrap();
dbg!("{:?}", stl_file);
Ok(())
}
Result
[src/main.rs:61:5] stl_file = StlFile {
header: [
34,
74,
101,
119,
101,
...
0,
0,
0,
0,
0,
0,
],
number_of_triangles: 2466,
triangles: [
Triangle {
normal_vector: [
-0.0,
0.0,
1.0,
],
vertex_1: [
146.41861,
-0.44428897,
0.0,
],
vertex_2: [
151.41861,
-0.44428897,
0.0,
],
vertex_3: [
146.56938,
0.4107614,
0.0,
],
attribute_byte_count: 0,
},
Triangle {
...
It's a wrap! ๐ฏ
- If you need to do parsing, string splitting is an option but it gets messy really quick
- Regex can be a better approach, but they are hard to learn, debug, and to get right
- `nom` offers a more interesting approach through parser combinators
- Once you learn how to create a parser function and some of the combinators available you are good to go!
- Nom also allows you to parse binary data!
Bonus content ๐ฑ
Why is it called nom? ๐ค
Why is it called nom? ๐ค
Bonus content ๐ฑ
nom will happily take a byte out of your files :)
Alternative parser combinator crates
Links
- nom documentation: docs.rs/nom
- nom combinators cheatsheet
- nom error management cheatsheet
- nom official examples
- our Redis Protocol parser library: tinyresp
- our live-coding playlist on YouTube
- for more Rusty live-streams: Twitchย - YouTube