It's alive!

Machine Learning writes your code

Dominic Elm

Uri Shaked

@elmd_

@UriShaked

How Everything

Started

@UriShaked

Angular Connect 2018

How to AI in JS? - Assim Hussain

@UriShaked

Thank You Assim!

@UriShaked

@UriShaked

Given a function signature, can we create a model that will predict the body of that function?

RESEARCH QUESTION

@UriShaked

Machine Learning 101

@UriShaked

email = 'How to be a Millionaire in 4 weeks'

if (email contains 'Millionaire')
  markAsSpam(email)
else if (email contains '...')
  ...
else if (email contains '...')
  ...
data = [
  ('How to be a Millionaire in 4 weeks', SPAM),
  ('...', NO_SPAM),
  ('...', NO_SPAM),
  ('...', SPAM),
  ...
]


for example in data:
  classify data
  optimize

Traditional Program

ML Program

@UriShaked

Neural Networks???

@UriShaked

...

120

4

24.4

square meters

#bedrooms

0.2

0.1

120 x 0.2

4 x 0.1

+

@UriShaked

...

120

4

24.4

square meters

#bedrooms

0.2

0.1

120 x 0.2

4 x 0.1

+

15

9.4

ERROR

@UriShaked

...

120

4

12.2

square meters

#bedrooms

0.1

0.05

120 x 0.1

4 x 0.05

+

15

-2.8

ERROR

@UriShaked

Input

Hidden

Output

@UriShaked

HOW DO WE PREDICT FUNCTION BODIES?

@UriShaked

MODEL

function greet(name: string)

?

function greet(name: string) {
  const prefix = name.length < 10 ? 'Hi' : 'Hello';
  return prefix + name;
}

@UriShaked

{
function greet(name: string) {
  const prefix = name.length < 10 ? 'Hi' : 'Hello';
  return prefix + name;
}

MODEL

function greet(name: string)

@UriShaked

const
function greet(name: string) {
  const prefix = name.length < 10 ? 'Hi' : 'Hello';
  return prefix + name;
}

MODEL

function greet(name: string)

@UriShaked

prefix
function greet(name: string) {
  const prefix = name.length < 10 ? 'Hi' : 'Hello';
  return prefix + name;
}

MODEL

function greet(name: string)

@UriShaked

Gather Data

Clean Data

Choose Model

Training

Evaluation

1

2

3

4

5

ML Approach

@UriShaked

@UriShaked

Gathering Data

1

How can we quickly gather a lot of function examples?

Look at open source projects on GitHub

@UriShaked

Gathering Data

1

We filtered only TypeScript files and extracted 324,280 TypeScript functions and collected them in a huge JSON file.

Using Google BigQuery we can run an SQL query to fetch all the code on GitHub in under a minute!

@UriShaked

CLEANING Data

2

function greet(name: string) {
  const prefix = name.length < 10 ? 'Hi' : 'Hello';
  return prefix + name;
}

@UriShaked

CLEANING Data

2

2

Prepare model inputs

1

Preprocess raw dataset

function greet(name: string)

Split signature from body

{
  const prefix = name.length < 10 ? 'Hi' : 'Hello';
  return prefix + name;
}

@UriShaked

function greet($arg0$: string)

Rename function parameters

{
  const prefix = $arg0$.length < 10 ? 'Hi' : 'Hello';
  return prefix + $arg0$;
}

@UriShaked

CLEANING Data

2

2

Prepare model inputs

1

Preprocess raw dataset

function greet($arg0$: string)

Rename identifiers and literals

{
  const id0 = $arg0$.id1 < 2 ? '3' : '4';
  return id0 + $arg0$;
}

@UriShaked

CLEANING Data

2

2

Prepare model inputs

1

Preprocess raw dataset

function greet ( $arg0$ : string )

Space tokens

{
  const id0 = $arg0$ . id1 < 2 ? '3' : '4' ;
  return id0 + $arg0$ ;
}

@UriShaked

CLEANING Data

2

2

Prepare model inputs

1

Preprocess raw dataset

function greet ( $arg0$ : string )

Add START and END symbols

START {
  const id0 = $arg0$ . id1 < 2 ? '3' : '4' ;
  return id0 + $arg0$ ;
} END

@UriShaked

CLEANING Data

2

2

Prepare model inputs

1

Preprocess raw dataset

Create Model Inputs and Outputs

Input

Ouput

function greet ( $arg0$ : string )
START
{
function greet ( $arg0$ : string )
START {
const
function greet ( $arg0$ : string )
START { const
id0

@UriShaked

CLEANING DATA

2

2

Prepare model inputs

1

Preprocess raw dataset

Building a dictionary with all tokens

function greet ( $arg0$ : string )
START {
  const id0 = $arg0$ . id1 < 2 ? '3' : '4' ;
  return id0 + $arg0$ ;
} END
dict = {
  'function': 1,
  'greet': 2,
  '(': 3,
  '$arg0$': 4,
  ':': 5,
  'string': 6,
  ')': 7,
  'START': 8,
  '{': 9,
  ...
}

@UriShaked

CLEANING DATA

2

2

Prepare model inputs

1

Preprocess raw dataset

Text to Sequence

function greet ( $arg0$ : string )
[1, 2, 3, 4, 5, 6, 7]

Add Padding

[0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7]
[1, 2, 3, 4, 5, 6, 7]
function isPrime ( $arg0$ : number )
[1, 13, 3, 4, 5, 23, 7]

@UriShaked

CLEANING DATA

2

2

Prepare model inputs

1

Preprocess raw dataset

Encode Output

{

Next Token(Y)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

One Hot Encoding

9
string
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
6

@UriShaked

CLEANING DATA

2

2

Prepare model inputs

1

Preprocess raw dataset

Choose MOdel

3

Look at Similar Problems

@UriShaked

Machine Translation

Choose MOdel

3

@UriShaked

Using Tensorflow

Choose MOdel

3

@UriShaked

Choose MOdel

3

@UriShaked

Training the Model

4

Google Colab

@UriShaked

Training the Model

4

Google Cloud TPU (TensorFlow Processing Unit)

@UriShaked

Evaluation

5

Evaluating the performance of the model

DEMO TIME

@UriShaked

TakeAways

  • Take advantage of the cloud 
  • Look for solutions to similar problems
  • Data Processing makes a big chunk of the work

@UriShaked

TakeAways

@UriShaked

https://urish.org

leaRn more

Made with Slides.com