Luciano Mammino PRO
Cloud developer, entrepreneur, fighter, butterfly maker! #nodejs #javascript - Author of https://www.nodejsdesignpatterns.com , Founder of https://fullstackbulletin.com
parsing made easy for Rustaceans
Rust Dublin Meetup 2024-03-06
Roberto Gambuzzi - Luciano Mammino
ð Principal Software Engineer here and there
Let's connect! linktr.ee/gambuzzi
Hello, I'm Roberto...
ðĪ Lazy by nature, Pythonic by choice
ðĶ Old enough to be rusted
ð I'm Luciano (ðŪðđðððĪ)
ðĻâðŧ Senior Architect @ fourTheorem
ð Co-Author of Node.js Design Patterns ð
Let's connect! linktr.ee/loige
A program that can turn text (or bytes) into structured information
A program that can turn text (or bytes) into structured information
"Point: (22, 17, -11)"
struct Point3D {
x: i32,
y: i32,
z: i32,
}
String split like a mad-man ðĪŠ
[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
label
remainder
String split like a mad-man ðĪŠ
[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
String split like a mad-man ðĪŠ
[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
[numbers,] = split(remainder, ")", 2)
String split like a mad-man ðĪŠ
[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
[numbers,] = split(remainder, ")", 2)
[x, y, z] = split(numbers, ", ", 3)
Point: (22, 17, -11)
/ /
Point: (22, 17, -11)
/^Point: /
Point: (22, 17, -11)
/^Point: \(/
Point: (22, 17, -11)
/^Point: \((-?\d+)/
Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+)/
Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)/
Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/
Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/
Capture groups
Point: (22, 17, -11)
/^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$/
Named capture groups
x
y
z
$ cargo add regex
yes, the Rust standard library doesn't have built-in support for regex (yet)! ðĪ·ââïļ
*
use regex::Regex;
#[derive(Debug)]
struct Point3D {
x: i32,
y: i32,
z: i32,
}
fn main() {
let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
let caps = re.captures("Point: (22, 17, -11)").unwrap();
let point = Point3D {
x: caps["x"].parse().unwrap(),
y: caps["y"].parse().unwrap(),
z: caps["z"].parse().unwrap(),
};
println!("{:?}", point);
}
use regex::Regex;
#[derive(Debug)]
struct Point3D {
x: i32,
y: i32,
z: i32,
}
fn main() {
let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
let caps = re.captures("Point: (22, 17, -11)").unwrap();
let point = Point3D {
x: caps["x"].parse().unwrap(),
y: caps["y"].parse().unwrap(),
z: caps["z"].parse().unwrap(),
};
println!("{:?}", point);
}
use regex::Regex;
#[derive(Debug)]
struct Point3D {
x: i32,
y: i32,
z: i32,
}
fn main() {
let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
let caps = re.captures("Point: (22, 17, -11)").unwrap();
let point = Point3D {
x: caps["x"].parse().unwrap(),
y: caps["y"].parse().unwrap(),
z: caps["z"].parse().unwrap(),
};
println!("{:?}", point);
}
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
â Jamie Zawinski
Tries to "eat and match" the source text (or bytes)
Tries to "eat and match" the source text (or bytes)
input = "Point: (22, 17, -11)"
1. readLabel(input)
"Point", ": (22, 17, -11)"
Parsed data
Remainder string
Tries to "eat and match" the source text (or bytes)
input = "Point: (22, 17, -11)"
1. readLabel(input)
"Point", ": (22, 17, -11)"
2. readSeparator(remainder)
":", " (22, 17, -11)"
Tries to "eat and match" the source text (or bytes)
input = "Point: (22, 17, -11)"
1. readLabel(input)
"Point", ": (22, 17, -11)"
2. readSeparator(remainder)
":", " (22, 17, -11)"
3. readSpaces(remainder)
" ", "(22, 17, -11)"
Tries to "eat and match" the source text (or bytes)
input = "Point: (22, 17, -11)"
1. readLabel(input)
"Point", ": (22, 17, -11)"
2. readSeparator(remainder)
":", " (22, 17, -11)"
3. readSpaces(remainder)
" ", "(22, 17, -11)"
4. readIntTuple(remainder)
[22,17,-11], ""
Handling errors: if we fail to match we have to return an error
input = "Hello World"
readNumber(input)
"Cannot match number on 'Hello World'"
$ cargo add nom
fn main() {
let s = "Point: (22, 17, -11)";
let (_, point) = parse_point(s).unwrap();
println!("{:?}", point);
}
fn parse_point(input: &str) -> IResult<&str, Point3D> {
// ... parsing logic goes here
Ok((input, Point3D { x, y, z }))
}
the value to parse from
A nom Result type
Input type
Output type
fn parse_point(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
// ...
Ok((input, Point3D { x, y, z }))
}
nom combinator: exact match
what to match
the value to match against
If it fails to match
we propagate the error
the parsed value
(in this case "Point: ")
the remainder string
Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
let (input, _) = tag("(")(input)?;
// ...
Ok((input, Point3D { x, y, z }))
}
Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
let (input, _) = tag("(")(input)?;
let (input, x) = i32(input)?;
// ...
Ok((input, Point3D { x, y, z }))
}
combinator that parses a sign (+ or -) and sequence of digits and converts them to a i32
first coordinate
Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
let (input, _) = tag("(")(input)?;
let (input, x) = i32(input)?;
let (input, _) = tag(", ")(input)?;
let (input, y) = i32(input)?;
let (input, _) = tag(", ")(input)?;
let (input, z) = i32(input)?;
let (input, _) = tag(")")(input)?;
Ok((input, Point3D { x, y, z }))
}
Point: (22, 17, -11)
/// parses "," followed by 0 or more spaces
fn separator(input: &str) -> IResult<&str, ()> {
let (input, _) = pair(tag(","), space0)(input)?;
Ok((input, ()))
}
nom combinator that applies 2 parsers in sequence and returns a tuple with the 2 parsed values
/// parses 3 numbers separated by a separator (e.g. "22, 17, -11")
fn parse_coordinates(input: &str) -> IResult<&str, (i32, i32, i32)> {
let (input, (x, _, y, _, z)) = tuple((
i32,
separator,
i32,
separator,
i32
))(input)?;
Ok((input, (x, y, z)))
}
nom combinator that applies a sequence of parsers in sequence and returns a tuple with the parsed values
Note how we are reusing the parser we just created!
fn parse_point2(input: &str) -> IResult<&str, Point3D> {
let (input, _) = tag("Point: ")(input)?;
let (input, (x, y, z)) = delimited(
tag("("),
parse_coordinates,
tag(")")
)(input)?;
Ok((input, Point3D { x, y, z }))
}
We already learned about these useful nom combinators:
Let's learn more by trying to parse RESP
the Redis Serialization Protocol ðĪ
RESP Primer (spec)
RESP Primer (spec)
RESP data type | First byte |
---|---|
Simple strings | + |
Simple Errors | - |
Integers | : |
Bulk strings | $ |
Arrays | * |
Nulls | _ |
Booleans | # |
Doubles | , |
Big numbers | ( |
Bulk errors | ! |
Verbatim strings | = |
Maps | % |
Sets | ~ |
Pushes | > |
Example messages
+OK\r\n
-ERR unknown command 'asdf'\r\n
:1000\r\n
$5\r\nhello\r\n
*2\r\n$5\r\nhello\r\n$5\r\nworld\r\n
,1.23\r\n
(3492890328409238509324850943850943825024385\r\n
%2\r\n+first\r\n:1\r\n+second\r\n:2\r\n
Simple string
Simple error
Integer
Bulk string
Array
Double
Big number
Map
["hello", "world"]
{"first": 1, "second": 2}
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub enum Value<'a> {
SimpleString(&'a str),
SimpleError(&'a str),
Integer(i64),
BulkString(&'a str),
Array(Vec<Value<'a>>),
Null,
Boolean(bool),
Double(String),
BigNumber(&'a str),
BulkError(&'a str),
VerbatimString(&'a str, &'a str),
Map(Vec<Value<'a>>, Vec<Value<'a>>),
Set(BTreeSet<Value<'a>>),
Pushes(Vec<Value<'a>>),
}
Recursive data type
pub fn parse_value(input: &str) -> IResult<&str, Value> {
let (input, type_char) = one_of("+-:$*_#,(!=%~>")(input)?;
let parser = match type_char {
'+' => parse_simple_string,
'-' => parse_simple_error,
':' => parse_integer,
'$' => parse_bulk_string,
'*' => parse_array,
'_' => parse_null,
'#' => parse_bool,
',' => parse_double,
'(' => parse_bignumber,
'!' => parse_bulk_error,
'=' => parse_verbatim_string,
'%' => parse_map,
'~' => parse_set,
'>' => parse_push,
_ => unreachable!("Invalid type"),
};
parser(input)
}
Alt-ernative approach ð
pub fn parse_value(input: &str) -> IResult<&str, Value> {
alt((
parse_simple_string,
parse_simple_error,
parse_integer,
parse_bulk_string,
parse_array,
parse_null,
parse_bool,
parse_double,
parse_bignumber,
parse_bulk_error,
parse_verbatim_string,
parse_map,
parse_set,
parse_pushes,
))(input)
}
fn crlf(input: &str) -> IResult<&str, &str> {
tag("\r\n")(input)
}
fn parse_simple_string(input: &str) -> IResult<&str, Value> {
let (input, _) = tag("+")(input)?;
let (input, value) = terminated(
take_while(|c| c != '\r' && c != '\n'),
crlf
)(input)?;
Ok((input, Value::SimpleString(value)))
}
+Simple String Example\r\n
fn parse_bulk_string(input: &str) -> IResult<&str, Value> {
let (input, _) = tag("$")(input)?;
let (input, length) = terminated(u32, crlf)(input)?;
let (input, value) = terminated(
take(length as usize),
crlf
)(input)?;
Ok((input, Value::BulkString(value)))
}
$19\r\nBulk String Example\r\n
fn parse_array(input: &str) -> IResult<&str, Value> {
let (input, _) = tag("*")(input)?;
let (input, length) = terminated(u32, crlf)(input)?;
let (input, values) = count(parse_value, length as usize)(input)?;
Ok((input, Value::Array(values)))
}
*2\r\n$5\r\nExample\r\n$5\r\nArray\r\n
Other useful nom combinators:
UINT8[80] â Header - 80 bytes
UINT32 â Number of triangles - 4 bytes
foreach triangle - 50 bytes:
REAL32[3] â Normal vector - 12 bytes
REAL32[3] â Vertex 1 - 12 bytes
REAL32[3] â Vertex 2 - 12 bytes
REAL32[3] â Vertex 3 - 12 bytes
UINT16 â Attribute byte count - 2 bytes
end
STL file format
ðĪ·ââïļ Did you know there's a monument of this teapot in Dublin?
use std::{fs, io};
use nom::bytes::complete::take;
use nom::multi::count;
use nom::number::complete::{le_f32, le_u16, le_u32};
use nom::sequence::tuple;
use nom::IResult;
#[derive(Debug)]
struct Triangle {
normal_vector: [f32; 3], // REAL32[3] - Normal vector
vertex_1: [f32; 3], // REAL32[3] - Vertex 1
vertex_2: [f32; 3], // REAL32[3] - Vertex 2
vertex_3: [f32; 3], // REAL32[3] - Vertex 3
attribute_byte_count: u16, // UINT16 - Attribute byte count
}
#[derive(Debug)]
pub struct StlFile<'a> {
header: &'a [u8], // UINT8[80] - Header
number_of_triangles: u32, // UINT32 - Number of triangles
// little endian
triangles: Vec<Triangle>, // Variable number of triangles
}
fn parse_triangle(input: &[u8]) -> IResult<&[u8], Triangle> {
let mut three_floats = tuple((le_f32, le_f32, le_f32));
let (input, normal_vector) = three_floats(input)?;
let (input, vertex_1) = three_floats(input)?;
let (input, vertex_2) = three_floats(input)?;
let (input, vertex_3) = three_floats(input)?;
let (input, attribute_byte_count) = le_u16(input)?;
Ok((
input,
Triangle {
normal_vector: [normal_vector.0, normal_vector.1,
normal_vector.2],
vertex_1: [vertex_1.0, vertex_1.1, vertex_1.2],
vertex_2: [vertex_2.0, vertex_2.1, vertex_2.2],
vertex_3: [vertex_3.0, vertex_3.1, vertex_3.2],
attribute_byte_count,
},
))
}
pub fn parse_stl(data: &[u8]) -> IResult<&[u8], StlFile> {
let (data, header) = take(80usize)(data)?;
let (data, number_of_triangles) = le_u32(data)?;
let (data, triangles) = count(parse_triangle,
number_of_triangles as usize)(data)?;
Ok((
data,
StlFile {
header,
number_of_triangles,
triangles,
},
))
}
fn main() -> io::Result<()> {
let file_path = "example.stl";
let data: Vec<u8> = fs::read(file_path)?;
let bin_data: &[u8] = &data;
let (_, stl_file) = parse_stl(bin_data).unwrap();
dbg!("{:?}", stl_file);
Ok(())
}
[src/main.rs:61:5] stl_file = StlFile {
header: [
34,
74,
101,
119,
101,
...
0,
0,
0,
0,
0,
0,
],
number_of_triangles: 2466,
triangles: [
Triangle {
normal_vector: [
-0.0,
0.0,
1.0,
],
vertex_1: [
146.41861,
-0.44428897,
0.0,
],
vertex_2: [
151.41861,
-0.44428897,
0.0,
],
vertex_3: [
146.56938,
0.4107614,
0.0,
],
attribute_byte_count: 0,
},
Triangle {
...
nom will happily take a byte out of your files :)
By Luciano Mammino
In this talk we will give you a pragmatic introduction to `nom` , a famous Rust parser combinator crate, which, in other words, is a tool that makes it easy to create parsers even for more advanced use cases. We will start by showcasing common suboptimal ways to do parsing including string splitting and using regexes. Then we will introduce nom and the main concepts it builds on, including its most commonly used functions. Finally, we will present several examples, from a few simple ones to building an entire parser for RESP, the Redis Serialization Protocol. If we do a good job, you'll probably forget about regex :)
Cloud developer, entrepreneur, fighter, butterfly maker! #nodejs #javascript - Author of https://www.nodejsdesignpatterns.com , Founder of https://fullstackbulletin.com