Programming Languages


// How to write a programming language in 60 min or less!


Game Plan



Demystify the unicorns behind programming languages



Game Plan



Step 1: Lexer 
Step 2: Parser
Step 3: Type Checker
Step 4: Interpreter
Step 5: ???
Step 6: Profit!

MOTIVATION





If you don’t know how compilers work, then you don’t know how computers work. If you’re not 100% sure whether you know how compilers work, then you don’t know how they work.

Steve Yegge

Motivation









Step 1: Lexer

one slide summary








A lexer accepts some input text and uses regular expressions to output lexemes (or tokens)

Theory


Nondeterministic Finite Automata


Deterministic Finite Automata

Applied Theory

Deterministic Finite Automata


Regular Expressions

(1*0+11*)+(0(0|1)*) 

Regex Example

Python
>> email_string = "mike@gmail.com, yadu@yahoo.org, chris@hotmail.edu"# Say I want to find the domains (i.e., domain.com) in test_string
>> import re
>> re.findall( r'(?<=@)[A-z]+.[com|net|org|edu]*', email_string )['gmail.com', 'yahoo.org', 'hotmail.edu']
Bash
$ cat employees.txt
Name          Phone Number    Undergraduate School
Michael R     202 123 4567    Virginia Tech
Yadu R        703 321 4567    University of Virginia
Chris S       301 987 4567    CalTech
$ grep -o '^[A-z ][A-z ]*[0-9]\{3\} ' employees.txt | sed 's/[A-z ]*//g' | sort -u
202
301
703



A lexer is made up of regexes that take

let x = 5 in {
  print (x + 1)
};
and turn it into "lexemes" (or "tokens")
 LET
 IDENTIFIER x
 EQ
 INTEGER 5
 IN
 LBRACE
 IDENTIFIER print
 LPAREN
 IDENTIFIER x
 PLUS
 INTEGER 1
 RPAREN
 RBRACE
 SEMICOLON

Lexer

Below is a lexer definition (using OCamllex) for our super small language called Imp
{
open Parse
} 

let blank = [' ' '\012' '\r' '\t' '\n']

rule initial = parse
  "/*"  { let _ = comment lexbuf in initial lexbuf }
| "(*"  { let _ = comment2 lexbuf in initial lexbuf }
| "//"  { endline lexbuf }
| blank { initial lexbuf }
| '+'           { PLUS }
| '-'           { MINUS }
| '*'           { TIMES }
| "true"        { TRUE }
| "false"       { FALSE }
| "="           
| "=="          { EQ_TOK }
| "<="          { LE_TOK }
| '!'           { NOT }
| "&&"
| "/\\"         { AND }
| "||"
| "\\/"         { OR }
| "skip"        { SKIP }
| ":="          { SET }
| ';'           { SEMICOLON }
| "if"          { IF }
| "then"        { THEN }
| "else"        { ELSE }
| "while"       { WHILE }
| "do"          { DO }
| "let"         { LET }
| "in"          { IN }
| "print"       { PRINT }

| '('           { LPAREN }
| ')'           { RPAREN } 
| '{'           { LBRACE }
| '}'           { RBRACE } 

| ("0x")?['0'-'9']+ {
  let str = Lexing.lexeme lexbuf in 
  INT((int_of_string str)) }

| ['A'-'Z''a'-'z''_']['0'-'9''A'-'Z''a'-'z''_']* {
  let str = Lexing.lexeme lexbuf in 
  IDENTIFIER(str)
  } 
| '.' 
| eof     { EOF } 
| _       { 
  Printf.printf "invalid character '%s'\n" (Lexing.lexeme lexbuf) ;
  (* this is not the kind of error handling you want in real life *)
  exit 1 }

and comment = parse
      "*/"  { () }
|     '\n'  { comment lexbuf }
|     eof   { Printf.printf "unterminated /* comment\n" ; exit 1 }
|     _     { comment lexbuf }
and comment2 = parse
      "*)"  { () }
|     '\n'  { comment2 lexbuf }
|     "(*"  { (* ML-style comments can be nested *) 
              let _ = comment2 lexbuf in comment2 lexbuf }
|     eof   { Printf.printf "unterminated (* comment\n" ; exit 1 }
|     _     { comment2 lexbuf }
and endline = parse
        '\n'      { initial lexbuf}
| _               { endline lexbuf}
|       eof       { EOF }







Step 2: Parser

ONE slide summary








A parser accepts tokens and uses a context-free grammar to build and output an abstract syntax tree.



Quick Exercise!



Write a regex that will determine if a string contains balanced parentheses

e.g.

 balanced( "(())" ) #=> should return true
 balanced( "))((" ) #=> should return false
 balanced( "())"  ) #=> should return false
 balanced( "(()"  ) #=> should return false
 balanced( "(()()(())((((((()))))))())" ) #=> should return true

Bad news


You can't.

I lied.

So sorry.



We need something better, faster*, stronger than a regex.

*Maybe not faster...

Theory

Context Free Grammar

They follow the form

where 
A is a single nonterminal symbol
B is a string of nonterminals, terminals, or the empty string ε


Balanced parentheses solution:


Applied Theory

a.k.a., why do we care about this?

Easiest (and most contrived) example - an in-fix calculator:


Programming languages usually follow a context free grammar*


Consider the following snippet of JavaScript
 function hello(x) { // <-- matched parentheses, matched braces
     return 3;
 } // <--

The grammar for a small, one arg JS function might look like


*Languages actually must be invoked with arguments of proper type, so we really can't say all languages follow a CFG. Some are context-sensitive, while others are Turing Complete to parse. C++ is one of these languages.

AmBIGUITY


These are grammatically correct, but lexically ambiguous sentences:

Cheryl gave Jane her notes.
James while John had had had had had had had had had had had a better effect on the teacher.
Buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo.



nonsense


Colorless green ideas sleep furiously


Noam Chomsky




Abstract syntax tree





Parser



Using the definitions above and some help from a yacc (yet another compiler compiler) tool, we can easily write parsers for both the calculator and that subset of JS (but we'll opt for our small language Imp).

Calculator Parser

Written using Jison, JavaScript's version of Bison - an improvement of yacc.
[source]
%start expressions
%%
expressions
    : e EOF
        {return $1;}
    ;

e
    : e '+' e
        {$$ = $1+$3;}
    | e '-' e
        {$$ = $1-$3;}
    | e '*' e
        {$$ = $1*$3;}
    | e '/' e
        {$$ = $1/$3;}
    | '(' e ')'
        {$$ = $2;}
    | INTEGER
        {$$ = Number(yytext);}
    ;
Not shown: associativity
e.g. Does X + Y * Z == (X + Y) * Z or X + (Y * Z)?


Parser

Below is a parser definition (using OCamlyacc) for our super small language called Imp
%{
open Imp		    

let error msg	= failwith msg
%}

%token <string>         IDENTIFIER
%token <int>            INT

%token PLUS 
%token MINUS 
%token TIMES 
%token TRUE
%token FALSE
%token EQ_TOK
%token LE_TOK
%token NOT
%token AND
%token OR 
%token SKIP
%token SET 
%token SEMICOLON
%token IF
%token THEN
%token ELSE
%token WHILE
%token DO 
%token LET
%token IN
%token PRINT
%token LPAREN
%token RPAREN
%token LBRACE
%token RBRACE

%token EOF

%start com
%type <Imp.com> com

%left AND
%left OR
%left PLUS MINUS
%left TIMES
%left LE_TOK EQ_TOK
%nonassoc NOT

%%

aexp : INT                                   { Const($1) }
| IDENTIFIER                                 { Var($1) }
| aexp PLUS aexp                             { Add($1,$3) } 
| aexp MINUS aexp                            { Sub($1,$3) } 
| aexp TIMES aexp                            { Mul($1,$3) } 
| LPAREN aexp RPAREN                         { $2 } 
;

bexp : TRUE                                  { True }
| FALSE                                      { False }
| aexp EQ_TOK aexp                           { EQ($1,$3) }
| aexp LE_TOK aexp                           { LE($1,$3) }
| NOT bexp                                   { Not($2) }
| bexp AND bexp                              { And($1,$3) }
| bexp OR bexp                               { Or($1,$3) }
;

com : SKIP                                   { Skip }
| IDENTIFIER SET aexp                        { Set($1,$3) } 
| com SEMICOLON com                          { Seq($1,$3) }
| IF bexp THEN com ELSE com                  { If($2,$4,$6) }
| WHILE bexp DO com                          { While($2,$4) }
| LET IDENTIFIER EQ_TOK aexp IN com          { Let($2,$4,$6) }
| PRINT aexp                                 { Print($2) }
| LBRACE com RBRACE                          { $2 } 
;







Step 3: Type Checker

Optional, but useful!

one slide summary








A type checker (or semantic analyzer) accepts an abstract syntax tree and outputs an annotated abstract syntax tree, where the annotations are types.


What's a type?

 A type is a set of values coupled with a set
 of operations on those values

 A type system specifies which operations
 are valid for which types

 Type checking can be done statically (at
 compile time, á la C, Java, Go) or dynamically
 (at run time, á la Python, JavaScript)



Some language constructs are not context-free - e.g. a method must be invoked with arguments of proper type

static vs dynamic, strong vs weak*


Static typing means the types are checked at compile-time. Some languages will infer the types at compile-time based on constraints it finds (e.g., Hindley-Milner Type Inference).

Dynamic typing means the types may be checked at run-time, but most likely the language will try to execute what you told it to, and throw a run-time exception if it fails.

Strong typing means you can't add apples and oranges together (an error will be thrown either at compile-time or run-time). Python, Rust, OCaml and Haskell are examples of strongly typed languages.

Weak typing means you can add apples and oranges together (no error will be thrown, and the result will hopefully follow some predefined standard of behavior). JavaScript, C, C++, and Java are examples of weakly typed languages.

*Note: there is no universally agreed definition of strong/weak typing



Type Rules


An expression e of type τ is written as e τ
The typing environment (or context) is written as Γ.
The turnstile, ⊢, means proves or determines.

e.g.
This reads: 
"If expression e i  has type  τ i  in environment Γ i for all i = 1.. n, then the expression e will have an environment Γ and type τ.


Type Rules

Another more real example:
Read these as 
"If expression 1 has type real in environment Γ and e 2 has type real in environment Γ, then the expression 1 + 2 will have an environment Γ and type real.
"If expression 1 has type integer in environment Γ and 2 has type integer in environment Γ, then the expression 1 + 2 will have an environment Γ and type integer.

Usefulness?

Let's play "What's the output?"
But first... historical side note about C89!
 int x = 0;
 int y = 5;
 {
   printf("1. x: %d, y: %d\n", x, y);
   {
     int x = 5;
     printf("2. x: %d, y: %d\n", x, y);
     {
       int y = 10;
       printf("3. x: %d, y: %d\n", x, y);
       x = 100;
     }
     printf("4. x: %d, y: %d\n", x, y);
   }
   printf("5. x: %d, y: %d\n", x, y);
 }
 $ gcc -o scoping scoping.c --std=c89
 $ ./scoping
1. x: 0, y: 5
2. x: 5, y: 5
3. x: 5, y: 10
4. x: 100, y: 5
5. x: 0, y: 5

Usefulness?

Scoping example cont'd


Avoiding scoping ambiguity is why languages like ML (and thus its descendants - OCaml, Caml, SML) use let..in statements, e.g.,
 let x = 1 in
   let y = x + 1 in
     let z = y + 3 in
       print_int z
 ;;
This reads:
If expression e' has type τ' and expression e has type τ in the environment Γ with the new variable "id" having type τ', then the let expression will have type τ.



More Usefulness?

Consider the following assembly language fragment
 addi $r1, $r2, $r3 # addi is add immediate; $r1 = $r2 + $r3
What are the types of $r1, $r2, and $r3?








step 4a: Interpreter

one slide summary








An interpreter accepts an [annotated] abstract syntax tree and sequentially executes the program.





JavaScript
Python*
Ruby
OCaml**
Bash
PHP
Perl
Lisp
MATLAB

*Python is actually byte-code interpreted, like Java
**OCaml actually has an interpreter, bytecode compiler, and native machine specific compiler

Theory


Operational Semantics are a precise way of specifying how to evaluate a program

The result of evaluating an expression depends on the result of evaluating its sub-expressions.


Some notation first:

<e, σ> ⇓ n

means expression e evaluates to n in state, or "memory", σ. This is a judgment. It asserts a relation between eσ, and n. We can view ⇓ as a function with two args (e and σ).

σ holds the current values of all variables 

operational semantics

For arithmetic and boolean expressions

Operational Semantics

For command expressions (e.g., if...else, while)

Interpreter


(*      
 * Our operational semantics has a notion of 'state' (sigma). The type
 * 'state' is a side-effect-ful mapping from 'loc' to 'n'.
 * 
 * See http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
 *)
type state = (loc, n) Hashtbl.t

let initial_state () : state = Hashtbl.create 255 

(* Given a state sigma, return the current value associated with
 * 'variable'. For our purposes all uninitialized variables start at 0. *)
let lookup (sigma:state) (variable:loc) : n = 
  try
    Hashtbl.find sigma variable 
  with Not_found -> 0 

(* Evaluates an aexp given the state 'sigma'. *) 
let rec eval_aexp (a:aexp) (sigma:state) : n = match a with
  | Const(n) -> n
  | Var(loc) -> lookup sigma loc 
  | Add(a0,a1) -> 
    let n0 = eval_aexp a0 sigma in
    let n1 = eval_aexp a1 sigma in
    n0 + n1
  | Sub(a0,a1) -> 
    let n0 = eval_aexp a0 sigma in
    let n1 = eval_aexp a1 sigma in
    n0 - n1
  | Mul(a0,a1) -> 
    let n0 = eval_aexp a0 sigma in
    let n1 = eval_aexp a1 sigma in
    n0 * n1

(* Evaluates a bexp given the state 'sigma'. *) 
let rec eval_bexp (b:bexp) (sigma:state) : t = match b with
  | True -> true
  | False -> false 
  | EQ(a0,a1) ->
    let n0 = eval_aexp a0 sigma in
    let n1 = eval_aexp a1 sigma in
    n0 = n1
  | LE(a0,a1) ->
    let n0 = eval_aexp a0 sigma in
    let n1 = eval_aexp a1 sigma in
    n0 <= n1
  | Not(b) ->
    not (eval_bexp b sigma)
  | And(b0,b1) ->
    let n0 = eval_bexp b0 sigma in
    let n1 = eval_bexp b1 sigma in
    n0 && n1
  | Or(b0,b1) ->
    let n0 = eval_bexp b0 sigma in
    let n1 = eval_bexp b1 sigma in
    n0 || n1

(* Evaluates a com given the state 'sigma'. *) 
let rec eval_com (c:com) (sigma:state) : state = match c with
  | Skip -> sigma
  | Set(l,a) ->
    let n = eval_aexp a sigma in
    Hashtbl.add sigma l n;
    sigma
  | Seq(c0,c1) ->
    let new_sigma = eval_com c0 sigma in
    eval_com c1 new_sigma
  | If(b,c0,c1) ->
    if eval_bexp b sigma then
      eval_com c0 sigma
    else
      eval_com c1 sigma
  | While(b,c) ->
    if not (eval_bexp b sigma) then
      sigma
    else begin
      let new_sigma = eval_com c sigma in
      let new_c = While(b,c) in
      eval_com new_c new_sigma
    end
  | Let(l,a,c) ->
    let original_n = lookup sigma l in
    let n = eval_aexp a sigma in
    Hashtbl.add sigma l n;
    let new_sigma = eval_com c sigma in
    Hashtbl.replace new_sigma l original_n;
    new_sigma
  | Print(a) -> 
    Printf.printf "%d" (eval_aexp a sigma);
    sigma
            
        







step 4b: Compiler

one slide summary







A compiler takes an [annotated] abstract syntax tree, usually generating intermediate levels of the language, ultimately generating some sort of byte-code or assembly, which is then converted to machine code by the assembler or executed directly on the language's virtual machine (e.g., JVM).




C
C++
Fortran
Java
OCaml*
Go
Haskell**
Rust***
Swift

*OCaml actually has an interpreter, bytecode compiler, and native machine specific compiler
**Haskell has an interpreter as well (GHCi)
***Rust used to have an interpreter before v0.9

compilation process



Similar to an interpreter, except instead of evaluating the expression immediately, a compiler generates byte-code or assembly.

e.g.,
 let x = 5 in {
   print x + 1
 }
compiles to the following (pseudo-)assembly
 mov   %eax, 5  ; read this as "move 5 into %eax"
 inc   %eax     ; the same as  "add %eax, 1"
 push  %eax     ; pushing %eax to the stack so it will be available by the callee "_print"
 call  _print   ; call the internal assembly method to print the contents of %eax

Step 4c: Optimizing Compiler


Compilers leverage many different techniques to optimize assembly generation, including, but not limited to:


Data-flow optimizations - e.g., Dead code elimination
Loop optimizations - e.g., Loop unrolling
Single-static assignment optimizations - e.g., constant propagation
Instruction selections
Removing recursion/tail-recursion
Stack height reduction

Intermediate representations






Compilers will perform these optimizations by using intermediate representations.
 // original language
 let x = 5 in {
   if x = 5 then print "Hello" else print "Bye!"
   let y = 6 in {
     print x + 1
   }
 }
# Intermediate Level 1
    x := 5
    if x = 5 goto L1    
    print "Bye"
    goto L2
L1: print "Hello"
L2: y := 6
    x := x + 1
    print x
# Intermediate Level 2 - optimized version
 x := 5
 print "Hello"
 x := x + 1
 print x
; Pseudo-Assembly
mov %eax, 5
push "Hello"
call _print
inc %eax
push %eax
call _print




The example in the previous slide could actually have been optimized further using constant propagation from
 # Intermediate Level 2 - optimized version 
 x := 5 
 print "Hello" 
 x := x + 1 
 print x
to
 # Intermediate Level 3 - further optimized version 
 print "Hello" 
 print 6
 ; Pseudo-Assembly from IL3
 push "Hello"
 call _print
 push 6
 call _print



Bootstrapping

also known as, the "rite of passage" for a language

Question: The C compiler is written in C... so how did they compile the first C compiler?
Answer: The first C compiler was actually written in B!


Turing Completeness

Which is the real Alan Turing,
and which one is Butterscotch Cabbagepatch?


Turing Completeness

If a programming language can simulate a Turing Machine, it is therefore Turing Complete

An imperative language (e.g., Java, C, Python) is Turing Complete if it has
  • conditional branching ("if", "goto", "branch if zero")
  • ability to change an arbitrary amount of memory locations







Why do we care?








Understanding how something is built gives us a deeper understanding of that something




axios tie-in



RedHawk has its own domain-specific language

X-Midas is its own language

Configuration files



Advanced Topics




I hope I have demystified the unicorns in programming languages

Sources

http://stackoverflow.com/questions/14589346/is-c-context-free-or-context-sensitive
http://www.cs.virginia.edu/~weimer/4610/lectures/weimer-pl-11.pdf
http://stackoverflow.com/questions/12532552/what-part-of-milner-hindley-do-you-not-understand
http://akgupta.ca/blog/2013/05/14/so-you-still-dont-understand-hindley-milner/

Programming Language Theory

By Michael Recachinas

Programming Language Theory

  • 826