New Generation Compiler

Alexander Tsepkov

by

10/20/2016

Beyond TypeScript, Babel, and RapydScript

What's a Compiler?

Compiler takes code written in one language and converts it into code in another language, optionally optimizing it.

There are Many Different Compilers

  • Human-readable code to machine code
    • C/C++, Java, Go
  • One human-readable language to another human-readable language
    • RapydScript, CoffeeScript, Haxe, TypeScript
  • Same-language compiler
    • Babel, UglifyJS, Py2to3

What's in a Compiler?

Parser

Lexer

AST

Output

Input

Output

There can be more stuff, but it's not necessary

Parser

Lexer

AST

Output

Transformer

Linter

Optimizer

SourceMaps

REPL

Input

Output

What's in a Compiler?

Parser

Lexer

AST

Output

Splits text into tokens (words)

Combines tokens into nodes (sentences)

Abstract Syntax Tree (essay)

Prints AST back in new format (translator)

Some purists will separate transformer and output, I did not.

Sometimes you'll see tokenizer instead of lexer, they're similar.

Lexer vs Tokenizer

Mary had a little lamb

Tokenizer

Lexer

token

token

token

token

verb

proper noun

adjective

noun

Think Regex

Think Syntax Highlighter

Lexer

function Person() {
  this.name = 'Jane';
}
{
  type: 'keyword',
  value: 'function',
  start: [0,0],
  end: [0,7]
}
{
  type: 'string',
  value: 'Jane',
  start: [1,14],
  end: [1,19]
}

Parser

function Person() {
  this.name = 'Jane';
}
Node {
  type: 'function',
  name: 'Person',
  arguments: [],
  block: [
    Node {
      type: 'assign',
      left: ...,
      right: ...
    }
  ]
}

AST

function Person() {
  this.name = 'Jane';
}

body

function

arguments

body

assign

left

right

string

dot

property

object

Output

function Person() {
  this.name = 'Jane';
}
class Person {
  constructor() {
    this.name = 'Jane';
  }
}

What I noticed while working with compilers?

  • Lexer/Tokenizer is usually straight forward, although it does require definition of every keyword, every operator, etc.
  • AST is straight-forward and often reads like a W3C spec.
  • Output is straight-forward (but often repetitive to write), and you'll typically find one-to-one mapping between print functions and AST nodes.
  • Parser is a can of worms...

Parsers...

Parsers...

  • Most parsers today are recursive-descent parsers.
  • They do not have a one-to-one mapping to AST because a single function can form multiple AST nodes or none at all.
  • They are full of long if/else chains and switch statements.
  • Whenever I add a new recursive call to another parser function I'm only about 80% sure I'm calling the correct function.

This sounds scary until you realize that because of how the intertwined the logic is, chances are you'll still end up in the same place, you may do an extra hop or two, you may run into an occasional bug.

Parsers...

Consider this example...

function makeFunc() {
  var name = "Mo" + "zilla";
  function displayName() {
    alert(name);
  }
  return displayName;
}

var myFunc = makeFunc();
myFunc();

Forget the parent function, I'm parsing this one now... until I get distracted by something else

Parsers...

...
case "function":
   return function_(AST_Defun);

case "if":
   return if_();

 

case "return":
   if (S.in_function == 0 && !options.bare_returns)
       croak("SyntaxError: 'return' outside of function");
...

This is actual UglifyJS source, there are about 50 cases in this switch statement

Building a Better Parser (Attendance Analogy)

  • Teacher calls out a name
  • Student raises their hand
  • Teacher marks the student as present, moves on

Building a Better Parser (Attendance Analogy)

Now imagine if attendance sheet was printed with a printer that couldn't handle newlines, spaces, or caps:

 

alexsmithjohndoehomersimpsonadalovelacebillgatesstevejobspaulallenstevewozniak

How would we take attendance then?

Building a Better Parser (Attendance Analogy)

alexsmithjohndoehomersimpsonadalovelacebillgatesstevejobspaulallenstevewozniak

  • Ask everyone whose name starts with A to put their hand up.
  • Of those people, ask everyone whose second letter is not L to put their hand down.
  • By 3rd letter, you shouldn't have more than a couple students with their hand up.
  • By the time you're done reading the name, you've identified a single student who has it.

AST Attendance

  • Coroutines! (ES6 generators)
  • Each token is a letter.
  • AST nodes that can form a coherent statement keep their "hand" up, AST nodes that can't, return false.
  • For each sequence of tokens, one of 3 events will eventually occur:
    • Exactly one coroutine ran to completion (generated AST node)
    • All coroutines returned false (Syntax Error)
    • Multiple coroutines ran to completion (Ambiguity Error)
  • Nodes no longer care about logic in other nodes, just like students don't need to know names of other students.

Pros

  • Parsing logic for each node is completely independent.
    • easier to maintain
    • easier to extend
    • easier to debug
  • With independent input AND output, we can split compiler by AST nodes rather than by stages (macros that put sweet.js to shame).
  • Compiler can be parallelized.

Cons

  • Redundancy, same logic can be repeated in multiple coroutines.
  • A bit more overhead.

This isn't a Slam-Dunk Solution

  • We need to spawn new coroutines while existing ones are processing (nested nodes)
  • We need to be aware not just WHICH coroutine can start with this token, but which coroutine will produce a node that the parent will allow
  • We need to be able to pause current coroutine (skip its turn without terminating it to handle nested nodes)
  • We need to be able to start a new copy of a coroutine while the old one is paused

Addressing Issues

First 3 problems can be solved by reusing the AST spec and having the coroutine itself pass the baton to another coroutine when it encounters position of expected node:

class Class(Scope):
    properties = {
        name: "[SymbolDeclaration?] the name of this class",
        init: "[Function] constructor for the class",
        parent: "[Class?] parent class this class inherits from",
        static: "[string*] list of static methods",
        external: "[boolean] true if class is declared elsewhere, but within current scope at runtime",
        decorators: "[Decorator*] function decorators, if any",
        module_id: "[string] The id of the module this class is defined in",
        statements: "[Node*] list of statements in the class scope (excluding method definitions)",
    }

Addressing Issues

4th problem is more challenging, but recursive generators may be up to the task:

function *doStuff() {
  yield 1;
  yield 2;
  yield *doStuff();
}

Addressing Issues

Finally, having addressed all parser problems, we can use templates to make output generation cleaner:

def _print(self):
  return `
    @${self.decorators.join('\n@')}
    function ${self.name}(${self.args.join(', ')}) {
      ${self.body.join(';\n')}
    }
  `

Putting it All Together

  • Each AST node is completely independent and can be maintained by a single individual without knowing the rest of the compiler.
  • Main thread would burst out the token to each AST node factory that's listening.
  • Resulting node would end up back in the main thread.
  • Each node would come with independent generation and output logic.

Typical AST Node

Developers can include independent transformers/optimizers

AST node

Parser coroutine (node factory)

AST template

Output template

Transformer

A new type of compiler

Main Thread

Tokens

Node Factories

A new type of compiler

Main Thread (lexer / AST)

token stream

Bob's AST Node

Jane's AST Node

Mary's AST Node

Greg's AST Node

  • Bob added support for functions
  • Mary added support for spread operator
  • Greg added support for lambda expressions
  • Array node is part of original compiler

Array Node

  • Neither Greg's nor Jane's mode could satisfy the token stream

If you want to help me out with this, reach out:

atsepkov@gmail.com

@atsepkov

github.com/atsepkov

New Generation Compiler

By Alexander Tsepkov

New Generation Compiler

  • 826