Pry - pragmatic parser combinators in D

Dmitry Olshansky

Dconf 2017

Setting up the stage

For me it all started out  with std.regex in 2011

... aimed just to plug a hole in the ecosystem

actually got us in the top regex libraries!

The tools that got us up on that hill are:

  1. Compile-time execution - building data structures
  2. Compile-time codegen - constructing the source code

Earned lot of experience dealing with Unicode

crystalized in new std.uni (2012)

Has been in the regex arms race ever since

Simpleixty of regex

There is a pure simple beautiful subset of regex

It's the one that actually runs fast

Woefully underpowered though

And then there are extensions... ugly beasts

Lookaround and backreferences kill any optimizations

The power they add is marginal at best

All in all

Highly overused due to popularity (use parser!)

Have severe usability problems (100+ lines of regex)

Challenge is to create and popularize parser generators 

State of things

Parser generators are generally frowned upon

Most languages end up re-writing their parsers by hand

The general usability problems are:

In particular Pegged integrates nicely with the language

Poor error handling 

Low performance

Actually there is a number of parser generators in D

Cumbersome extra build step

But I find it idealistic and not performance minded

Ideals and Goals

Want a parser generator that

Easy to use - less hassle than writing by hand

Performs on par with handwritten parser

Has simple and composable implementation

Key principle - performance first, features second

Has sensible error handling

If it's too slow nobody will use it

Can always add features later, unlike performance 

Parser combinators

A parser is basically a function:

Input -> OneOf(Value, Error)

Input

Error

Modified input

Value

Parser combinators

Naturally parsers can be combined as a sequence

creating a new parser

Input

Error

Tuple(X, Y)

If first parser succeeds the next one is applied

the result is considered as a tuple of values

Parser combinators

Alternatively parsers can be combined as a choice

Input

Error

Algebraic(X, Y)  - sum type (union)

Only if the first parser fails the next one in the chain is tried, the result is naturally an Algebraic(X,Y)

Input

Bits and pieces

Library generally provides

Atoms - basic building blocks:

token, literal, char class (a-la regex) , etc.

Combinators:

sequence, alternative, repetition, slice, delimited sequence, aa, lookahead, etc.

Grammar:

a module that constructs combinators from textual DSL - PEG grammar

New atoms and combinators could be easily written by user

Show me the code!

Let's consider the most basic parser - a fixed token

struct Tk(alias c) {
    static immutable msg = "expected '" ~ to!string(c)~"'";
    alias Value = ElementType!Stream;

    bool parse(ref Stream stream, ref Stream value, ref Stream.Error err) const {
        if(stream.empty) {
            err.location = stream.location;
            err.reason = "unexpected end of stream";
            return false;
        }
        if(stream.front == c){
            value = c;
            stream.popFront();
            return true;
        }
        else {
            err.location = stream.location;
            err.reason = msg;
            return false;
        }
    }
}

auto tk(alias c)(){ return Tk!c(); }

Char classes

More serious building block - test if char belongs to a set

struct Set(alias set) {
    import std.uni;
    enum val = set.byInterval.length;
    static if(val <= 6) {
        // Generate optimal "binary search" of if/else clauses
        mixin("static " ~ set.toSourceCode("test"));
    }
    else {
        // This actually builds multi-staged lookup table at compile-time
        static immutable matcher = CharMatcher(set);

        static bool test(dchar ch){
            return matcher[ch];
        }
    }
    ... // same as tk save for the test
}

This leverages the same fast lookup tables as std.regex

Sequence

Implementing a sequence with D's variadic templates

struct Seq(P...){
    alias Stream = ParserStream!(P[0]);
    alias Value = Tuple!(staticMap!(ParserValue, P));

    private P parsers;
    
    bool parse(ref Stream stream, ref Value value, ref Stream.Error err) const {
        auto save = stream.mark;
        foreach(i, ref p; parsers) {
            if(!p.parse(stream, value[i], err)){
                stream.restore(save);
                return false;
            }
        }
        return true;
    }
}

Alternative

Again going to use variadic template

struct Any(P...){
    alias Stream = ParserStream!(P[0]);
    alias Values = NoDuplicates!(staticMap!(ParserValue, P));
    alias Value = Algebraic!Values;
    private P parsers;

    bool parse(ref Stream stream, ref Value value, ref Stream.Error err) const {
        ...
    }
}

Alternative #2

bool parse(ref Stream stream, ref Value value, ref Stream.Error err) const {
        Stream.Error current;
        foreach(i, ref p; parsers) {
            ParserValue!(P[i]) tmp;
            static if(i == 0){
                if(p.parse(stream, tmp, err)){
                    value = tmp;
                    return true;
                }
            }
            else {
                if(p.parse(stream, tmp, current)){
                    value = tmp;
                    return true;
                }
                // pick the deeper error
                if(err.location < current.location){
                    err = current;
                }
            }
        }
        return false;
}

Array

Repeatedly apply a parser and append values to array

struct ArrayImpl(size_t minTimes, size_t maxTimes, Parser){
    alias Stream = ParserStream!Parser;
    alias Value = ParserValue!Parser[];
    private Parser parser;

    bool parse(ref Stream stream, ref Value value, ref Stream.Error err) const {
        auto start = stream.mark;
        ParserValue!Parser tmp;
        size_t i = 0;
        value = null;
        for(; i<minTimes; i++) {
            if(!parser.parse(stream, tmp, err)){
                stream.restore(start);
                return false;
            }
            value ~= tmp;
        }
        for(; i<maxTimes; i++){
            if(!parser.parse(stream, tmp, err)) break;
            value ~= tmp;
        }
        return true;
    }
}

Forward reference

Sometimes we need ability to do self-recursion

interface DynamicParser(V) {
    bool parse(ref Stream stream, ref V value, ref Stream.Error err) const;
}

// Use LINE & FILE to provide unique types of dynamic.
auto dynamic(V, size_t line=__LINE__, string file=__FILE__)(){
    static class Dynamic : DynamicParser!V {
        DynamicParser!V wrapped;
    final:
        void opAssign(P)(P parser)
        if(isParser!P && !is(P : Dynamic)){
            wrapped = wrap(parser);
        }

        bool parse(ref Stream stream, ref V value, ref Stream.Error err) const {
            assert(wrapped, "Use of empty dynamic parser");
            return wrapped.parse(stream, value, err);
        }
    }
    return new Dynamic();
}

...

Have to reference a parser that is not fully constructed

Forward reference

And the second bit - wrapping any parser as dynamic

auto wrap(Parser)(Parser parser){
    alias V = ParserValue!Parser;
    static class Wrapped: DynamicParser!V {
        Parser p;

        this(Parser p){
            this.p = p;
        }

        bool parse(ref Stream stream, ref V value, ref Stream.Error err) const {
            return p.parse(stream, value, err);
        }
    }
    return new Wrapped(parser);
}

May raise a valid concern about performance

Practical example

auto calc(){
    with(parsers!string) {
        auto expr = dynamic!int;
        auto primary = any(
            range!('0', '9').rep.map!(x => x.to!int),
            seq(tk!'(', expr, tk!')').map!(x => x[1])
        );
        auto term = dynamic!int;
        term = any(
            seq(primary, tk!'*', term).map!(x => x[0] * x[2]),
            seq(primary, tk!'/', term).map!(x => x[0] / x[2]),
            primary
        );
        expr = any(
            seq(term, tk!'+', expr).map!(x => x[0] + x[2]),
            seq(term, tk!'-', expr).map!(x => x[0] - x[2]),
            term
        );
        return expr;
    }
}
unittest {
	assert("2+4*(2+3)".parse(calc) == 22);
}

A simple arithmetic expression parser

Perf Consideration

term = any(
            seq(primary, tk!'*', term).map!(x => x[0] * x[2]),
            seq(primary, tk!'/', term).map!(x => x[0] / x[2]),
            primary
        );

A subtle problem e.g. the following parser  will call 'term'

3 times on expression "42"

expr = any(
            seq(term, tk!'+', expr).map!(x => x[0] + x[2]),
            seq(term, tk!'-', expr).map!(x => x[0] - x[2]),
            term
        );

Each of those in turn calls primary 3 times 

In total 9 times parsing the simple digit string! 

Something went better then expected

Solutions

Packrat approaches problem in its full generality

  • memoize each recursive call and respective position in input
  • each time instead of calling smth. check the cache 

In real world not a single hand-written parser does it

...yet they don't degrade to exponential behavior

Truly academical achievement:

O(n) parsing but in O(n) space

They do unthinkable - they just don't repeat the same work twice if it's the same in each alternative

Merging prefixes

The idea is to detect the following pattern:

auto x = any(
    seq(Prefix, Suffix1).map!(...),
    seq(Prefix, Suffix2).map!(...),
    ...
    Prefix.map!(...)  // potentially the lone prefix on its own
);

Conceptually transform it into:

auto x = seq(Prefix, any(
        Suffix1,
        Suffix2,
        ...
        Epsilon // empty parser
    )
).map!(...);

Can't do it litterally like that due to how map contains arbitrary code

Takes a bit of meta-programming - needs those unique types

Performance

Simple arithmetic expressions (looong ones)

Kind Time, ms LOCs
Handwritten 57ms 92
Pry 67ms 23
Kind Time, us LOCs
std.json 1098 326
stdx.data.json 688 ~1600*
Pry 769 86

JSON parsing ~33Kb of RPC-message

* cutting out multi-line comments and unittests, etc.

Going to Grammar

Taking a page from Pegged project here

unittest {
  mixin(grammar(`
    calc:
      expr : int <- 
        (term '+' expr) { return it[0] + it[2]; } 
        / (term '-' expr) { return it[0] - it[2]; }
        / term ;
      term : int <-
        (primary '*' term) { return it[0] * it[2]; }
        / (primary '/' term) { return it[0] / it[2]; }
        / primary ;
      primary <- 
        [0-9]+ { return to!int(it); } 
        / :'(' expr :')';
    `));
  assert(" ( 2 + 4) * 2".parse(calc) == 12);
}

1. Need to run full parser of PEG grammar at compile-time

2. Generate appropriate sequence of calls to combinators

Parsing the PEG

Need to build a compile-time parser at compile-time

Produces AST that has to be processed at compile-time

Regex character classes are reused from std.regex*

Actually works!

Same combinators API utilized

~200LOCs with tests and such

*Pull request is still hanging in the Q

Tracking dependencies

PEG rules basically form a directed graph of dependencies

Need to establish an order of code generation

Some rules will have to be forward-referenced

A

B

C

D

E

Tracking dependencies #2

Do a topological sort  at compile-time (simpler than it sounds)

Each detected cycle is broken, back edge will be dynamic

A

B

C

D

E

1

2

3

4

5

Dynamic

The order is in priority: 5-4-3-2-1

Following codegen is straight-forward

Open problems

To skip whitespace or not to skip whitespace

Will happily cause a stack overflow with horrible stack trace

Current approach needs more thought 

Error messages are not very helpful yet

Combinators tend to produce 10K+ bytes long symbols

In case something goes wrong stack traces are unhelpful

Left recursion is not detected nor supported 

Future directions

On the combinators API

Grammar module is very early development, still need:

Proper type-checking with user-friendly errors

Detect left-recursion, supporting it(?)

Provide more real-world examples!

Ability to auto-generate sensible AST classes

Want to support "parsing" binary formats in the same fashion

Support allocators for "array" and "aa"

Document all things!

That's it!

Stay pragmatic

and get involved on Github

https://github.com/DmitryOlshansky/pry

Pry pragmatic parser combinators in D

By Dmitry Olshansky

Pry pragmatic parser combinators in D

  • 76