Tokenizing, Parsing, and Static Analysis


Mike Sherov

twitter / github : @mikesherov

Principal Engineer - SkillShare

Maintainer: Esprima, ESLint, ESTree

How are programs validated for correctness?

  • Unit and Functional Testing
  • Bug reports ;-)
  • Static Analysis - **AST**

What is an AST?

  • Tree structure containing the abstract syntatic structure (but not semantic meaning) of a given program.
  • Static Analysis tools use an AST to extract meaning.
  • Produced by a parser.

How do parsers turn source code into an AST?

  • Programmers -> Source File
  • Scanner -> Characters
  • Lexer -> Tokens
  • Parser -> **AST**

Parser Deep Dive!

While wearing floaties

Scanner

  • Where am I in the file?
  • What is the next character?
  • What are the next few chars?

Scanner

function Scanner(source) {
  this.source = source;
  this.position = 0;
}

Scanner.prototype.peek = function() {
  return {
    value: this.source[this.position]
    position: this.position
  };
}

Scanner.prototype.advance = function() {
  var result = this.peek();
  this.position++;

  return result;
}

Lexer

  • How are strings (string literals) delimited?
  • What counts as whitespace? comments?
  • What are the valid operators and keywords?
  • What is a valid identifier?
  • What are the rules for writing comments?
  • etc....

Lexer

function Lexer(scanner) {
  this.scanner = scanner;
}

Lexer.prototype.getToken = function() {
  var char = this.scanner.advance();

  if (char.value === '+') {
    var operator = '+';

    if (this.scanner.peek().value === '+') {
      operator = '++';
      this.scanner.advance(); // consume peeked token
    }

    return {
      type: 'punctuator',
      value: operator,
      range: {
        start: char.pos, 
        end: char.pos + operator.length
      }
    }; 
  }
  ...
}

Parser

  • What is the syntatic structure of a variable declaration?
  • Is it legal to have a try statement without a following block statement?
  • What tokens go in what order?

Parser

function Parser(lexer) {
  this.lexer = lexer;
  this.lex();
}

// advances to the next token
Parser.prototype.lex = function() {
  this.cur = this.lexer.getToken();
};

// token comparison
Parser.prototype.match = function(value) {
  return this.cur.value === value && 
         this.cur.type === 'punctuator';
};

// advances to the next token, 
// and fails if value is incorrect
Parser.prototype.expect = function(value) {
  this.lex();
  if (!this.match(value)) {
    throw new Error('unexpected token: ' + token.value);
  }
};

Parser

Parser.prototype.parseCondExpr = function() {
  var expr = this.parseBinaryExpr();
  
  // is current token a `?`
  if (this.match('?')) {
    // advance past `?`
    this.lex();

    var consequent = this.parseAssignmentExpression();
    // move past :, but fail if it's not :
    this.expect(':');
    var alternate = this.parseAssignmentExpression();

    return {
      type: 'ConditionalExpression',
      test: expr,
      consequent: consequent,
      alternate: alternate
    };
  }

  return expr;
};
CondExpr :
  BinaryExpr
  BinaryExpr ? AssignmentExpr : AssignmentExpr

Grammar (EBNF)

Terminology

<test> ? <consequent> : <alternate>

Esprima

A fast recursive descent Javascript parser

https://github.com/jquery/esprima

Finally, an AST!

hasAnswer ? 42 : 0;
{
  "type": "Program",
  "body": [
    {
      "type": "ExpressionStatement",
      "expression": {
        "type": "ConditionalExpression",
        "test": {
          "type": "Identifier",
          "name": "hasAnswer"
        },
        "consequent": {
          "type": "Literal",
          "value": 42,
          "raw": "42"
        },
        "alternate": {
          "type": "Literal",
          "value": 0,
          "raw": "0"
        }
      }
    }
  ]
}

Static Analysis

Disallowing Yoda Conditions

if (900 === yearsOldYouReach) {
  lookAsGood = youWilNot;
}

How do you describe this is in a way that uses AST terminology?

Static Analysis

Disallowing Yoda Conditions

{
    "type": "BinaryExpression",
    "operator": "===",
    "left": {
        "type": "Literal",
        "value": 900,
        "raw": "900"
    },
    "right": {
        "type": "Identifier",
        "name": "yearsOldYouReach"
    }
}

A Binary Expression where the operator is an equality operator, and whose left hand side is a literal, and whose right hand side is not.

Static Analysis

Disallowing Yoda Conditions

ESLint

A Javascript Linter

https://github.com/eslint/eslint

ESLint Visitor Pattern

Calls callbacks for each visited node

{
  create(context) {
    BinaryExpression(node) {
      // will be called once for each 
      // binary expression in the tree
    }
  }
}

ESLint context.report

API for reporting errors

context.report({
  node,
  message: "Expected literal to be on the right side of {{operator}}.",
  data: {
    operator: node.operator
  }
  fix: fixer => fixer.replaceText(node, getFixedString(node))
});

ESLint: putting it all together

disallowYodaConditions

const comparators = new Set(['==', '===', '!=', '!==', '>', '<', '>=', '<=']);

module.exports = {
  create(context) {
    BinaryExpression(node) {
      if (comparators.has(node.operator) && node.left.type === 'Literal') {
        context.report({
          node,
          message: "Expected literal to be on the right side of {{operator}}.",
          data: {
            operator: node.operator
          }
        });
      }
    }
  }
};

RESOURCES

What Questions Do You Have?

Tokenizing, Parsing, and Static Analysis

By mikesherov

Tokenizing, Parsing, and Static Analysis

  • 2,224