Tokenizaton
parsing
compilation
IN
ruby

Tomasz (warkocz) Warkocki

plan

  • Ruby
  • Tokenization
  • Parsing
  • Compilation?

ruby

ruby

  • dynamic, object-oriented programming language
  • Yukihiro “Matz” Matsumoto
  • public release 1995
  • inspired by: Perl, Smalltalk, Eiffel, Ada and Lisp
  • various versions: MacRuby, IronRuby, Topaz, JRuby
  • presentation is based on MRI (Matz’s Ruby Interpreter)

processing steps

Tokenization

10.times do |n|
    puts n
end

parse.y and parser_yylex method

ripper

require 'ripper'
require 'pp'
code = <<STR
10.times do |n|
puts n end
STR
puts code
pp Ripper.lex(code)
[[[1, 0], :on_int, "10"],
 [[1, 2], :on_period, "."],
 [[1, 3], :on_ident, "times"],
 [[1, 8], :on_sp, " "],
 [[1, 9], :on_kw, "do"],
 [[1, 11], :on_sp, " "],
 [[1, 12], :on_op, "|"],
 [[1, 13], :on_ident, "n"],
 [[1, 14], :on_op, "|"],
 [[1, 15], :on_ignored_nl, "\n"],
 [[2, 0], :on_ident, "puts"],
 [[2, 4], :on_sp, " "],
 [[2, 5], :on_ident, "n"],
 [[2, 6], :on_sp, " "],
 [[2, 7], :on_kw, "end"],
 [[2, 10], :on_nl, "\n"]]

parsing

bison

Next version of Yacc (Yet Another Compiler Compiler)

Look-Ahead Left Reversed Rightmost Derivation (LALR) Parse Algorithm

SpanishPhrase : me gusta el ruby {
  printf("I like Ruby\n");
}
  • Me gusta el Ruby.
  • I like Ruby.
  • Me gusta el Ruby.
  • Le gusta el Ruby.
SpanishPhrase: VerbAndObject el ruby {
  printf("%s Ruby\n", $1);
};
VerbAndObject: SheLikes | ILike {
$$ = $1; };
SheLikes: le gusta {
  $$ = "She likes";
}
ILike: me gusta {
  $$ = "I like";
}

$$ - returns a value from a child grammar rule to a parent 

$1 - refers to a child’s value from a parent

Shift or reduce, that is the question? ;)

Look ahead and check table of possibilities (complex state machine)

Real example

wrug$ ruby -y simple.rb 

Starting parse
Entering state 0
Reducing stack by rule 1 (line 903):
-> $$ = nterm $@1 ()
Stack now 0
Entering state 2
Reading a token: Next token is token tINTEGER ()
Shifting token tINTEGER ()
Entering state 41
Reducing stack by rule 505 (line 4411):
   $1 = token tINTEGER ()
-> $$ = nterm simple_numeric ()
Stack now 0 2
Entering state 112
Reducing stack by rule 503 (line 4399):
   $1 = nterm simple_numeric ()
-> $$ = nterm numeric ()
Stack now 0 2
Entering state 111
Reducing stack by rule 451 (line 3874):
   $1 = nterm numeric ()
-> $$ = nterm literal ()
Stack now 0 2
Entering state 99
Reducing stack by rule 276 (line 2627):
   $1 = nterm literal ()
-> $$ = nterm primary ()
Stack now 0 2
Entering state 85
Reading a token: Next token is token '.' ()
Reducing stack by rule 340 (line 3100):

Again Ripper tool, this time parsing

require 'ripper'
require 'pp'
code = <<STR
   10.times do |n|
     puts n
   end
STR
puts code
pp Ripper.sexp(code)
wrug$ ruby lex2.rb 
   10.times do |n|
     puts n 
   end
[:program,
 [[:method_add_block,
   [:call, [:@int, "10", [1, 3]], :".", [:@ident, "times", [1, 6]]],
   [:do_block,
    [:block_var,
     [:params, [[:@ident, "n", [1, 16]]], nil, nil, nil, nil, nil, nil],
     false],
    [[:command,
      [:@ident, "puts", [2, 5]],
      [:args_add_block, [[:var_ref, [:@ident, "n", [2, 10]]]], false]]]]]]]

Abstract Syntax Tree (AST)

[[:command,
  [:@ident, "puts", [2, 2]],
      [:args_add_block, [[:var_ref, [:@ident, "n", [2, 7]]]],
                        false]]]
require 'ripper'
require 'pp'
code = <<STR
  2+2*3
STR
puts code
pp Ripper.sexp(code)
wrug$ ruby lex3.rb 
  2+2*3
[:program,
 [[:binary,
   [:@int, "2", [1, 2]],
   :+,
   [:binary, [:@int, "2", [1, 4]], :*, [:@int, "3", [1, 6]]]]]]
# @ NODE_SCOPE (line: 1)
# +- nd_tbl: (empty)
# +- nd_args:
# |   (null node)
# +- nd_body:
#     @ NODE_FCALL (line: 1)
#     +- nd_mid: :puts
#     +- nd_args:
#         @ NODE_ARRAY (line: 1)
#         +- nd_alen: 1
#         +- nd_head:
#         |   @ NODE_CALL (line: 1)
#         |   +- nd_mid: :+
#         |   +- nd_recv:
#         |   |   @ NODE_LIT (line: 1)
#         |   |   +- nd_lit: 2
#         |   +- nd_args:
#         |       @ NODE_ARRAY (line: 1)
#         |       +- nd_alen: 1
#         |       +- nd_head:
#         |       |   @ NODE_CALL (line: 1)
#         |       |   +- nd_mid: :*
#         |       |   +- nd_recv:
#         |       |   |   @ NODE_LIT (line: 1)
#         |       |   |   +- nd_lit: 2
#         |       |   +- nd_args:
#         |       |       @ NODE_ARRAY (line: 1)
#         |       |       +- nd_alen: 1
#         |       |       +- nd_head:
#         |       |       |   @ NODE_LIT (line: 1)
#         |       |       |   +- nd_lit: 3
#         |       |       +- nd_next:
#         |       |           (null node)
#         |       +- nd_next:
#         |           (null node)
#         +- nd_next:
#             (null node)
wrug$ ruby --dump parsetree simple2.rb

compilation...

Ruby <= 1.8

Ruby >= 1.9

Based on:

Thank you!

Any questions? :)

Tokenization, Parsing and Compilation in Ruby

By Tomasz Warkocki

Tokenization, Parsing and Compilation in Ruby

  • 2,315