Machine Learning Basics

The essence of machine learning:

A pattern exists
We cannot pin it down mathematically
We have data on it

(Abu-Mostafa, 2012)

Can we learn the credit approval?

(Abu-Mostafa, 2012)

Components of Learning

(Abu-Mostafa, 2012)

Components of Learning

(Abu-Mostafa, 2012)

Solution Components

(Abu-Mostafa, 2012)

A simple hypothesis set - the perceptron

(Abu-Mostafa, 2012)

A simple hypothesis set - the perceptron

(Abu-Mostafa, 2012)

A simple hypothesis set - the perceptron

Feedforward neural network is an ANN wherein connections between the nodes do not form a cycle

Basic premise of learning: "using a set of observations to uncover an underlying process"

Supervised Learning
Unsupervised Learning
Reinforcement Learning

(Abu-Mostafa, 2012)

Supervised learning

(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)

( input , correct outcome )

(Abu-Mostafa, 2012)

Unsupervised learning

(x_1, ?), (x_2, ?), ..., (x_n, ?)

( input , ? )

(Abu-Mostafa, 2012)

Reinforcement learning

(x_1, y_1, P(y_1)), (x_2, y_2, P(y_2)), ..., (x_n, y_n, P(y_n))

( input , some outcome, belief for the outcome )

(Abu-Mostafa, 2012)

Is Learning Feasible?

(Abu-Mostafa, 2012)

Summing up

A Shallow Tutorial of Deep Learning

The problem of representation and why representations matter

(Goodfellow, 2016)

Computational Graphs map inputs to outputs where nodes perform an operation (activation function)

(Goodfellow, 2016)

AI

Machine Learning

Representation (Feature) Learning

(Goodfellow, 2016)

AI

Machine Learning

Representation (Feature) Learning

Deep Learning

(Goodfellow, 2016)

AI

Probabilistic Reasoning

Machine Learning

Logistic Regression

Representation (Feature) Learning

Shallow Autoencoders

Deep Learning

MLPs

Text

(Goodfellow, 2016)

Depth and Repeated Composition

(Goodfellow, 2016)

Learning Multiple Components: how the different parts of an AI system relate to each other within diﬀerent AI disciplines

(Goodfellow, 2016)

Input

Hand-designed program

Features

Simple Features

Output

Mapping from features

Additional layers of more abstract features

Learning Multiple Components: how the different parts of an AI system relate to each other within diﬀerent AI disciplines

(Goodfellow, 2016)

Input

Hand-designed program

Features

Simple Features

Output

Mapping from features

Additional layers of more abstract features

Rule- based systems

Classic machine learning

Representation learning

Deep Learning

The many names and changing fortune of Neural Networks

(Goodfellow, 2016)

AlexNet (Hinton et al, 2012)

VGG (Simonyan et al, 2014)

GoogleNet (Szegedy et al, 2015)

Machine Learning and Evaluation Methods

Linear Regression

(Goodfellow, 2016)

Over-fitting and Under-fitting in Polynomial Estimation?

(Goodfellow, 2016)

Over-fitting and Under-fitting in Polynomial Estimation?

(Goodfellow, 2016)

Underfitting

Appropriate Capacity

Overfitting

Generalization Error is the difference between out-sample and in-sample error and the model Capacity is the ability to fit a variety of functions

(Goodfellow, 2016)

Effect of Training Set Size

(Goodfellow, 2016)

Weight Decay and Regularization

(Goodfellow, 2016)

J(w) = MSE_{train} + \lambda w^Tw

Epoch: One forward & backward pass of all the training examples.
Batch Size: Number of training examples in one back/forward pass.
Iterations: Number of passes (one forward + one backward), each pass using a fixed batch size.

Fun evaluation terminology for Deep Learning Approaches

Epoch: One forward & backward pass of all the training examples.
Batch Size: Number of training examples in one back/forward pass.
Iterations: Number of passes (one forward + one backward), each pass using a fixed batch size.

Fun evaluation terminology for Deep Learning Approaches

If you have 1000 training examples, and the batch size is 500, then it will take 2 iterations to complete 1 epoch

We might use ROC and Precision-Recall Curves to evaluate classification problems

(Devis & Goadrich, 2006)

Common machine learning evaluation metrics

(Devis & Goadrich, 2006)

Precision

How many selected items are relevant?
True Positives
Contrasting False Positives

Recall

How many relevant items are selected?
True Positives
Contrasting False Negatives

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

TensorFlow Quick Tutorial

Dataset

Supervised: Training and Testing

Model Definition

Loss Function

Fitting

Evaluation

(Abu-Mostafa, 2012)

Topics out there

Deep Learning and Similarity

Binary Code Similarity

Malware detection
Vulnerability detection
Bug Search
Cross-Platform (x86, ARM, MIPS)
Plagiarism Detection
Traceability (?)

Related?

BinaryCode IOT 1

BinaryCode IOT 2

Binary Code Similarity

Goal: to detect similar functions directly in binary code.

Related?

BinaryCode IOT 1

BinaryCode IOT 2

The embedding world

Word2Vec (Mikolov, et al., 2013)
Doc2Vec (Mikolov, et al., 2014)
Structure2Vec (Dai, et al., 2016)

http://gear.github.io/2016-09-05-MAGE/

Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection

by Xiaojun Xu, et al.

Presented by David A.N

The Problem

Drawbacks in art approach, specifically, graph matching:

The similarity function is hard to adapt to different applications
The efficiency is bounded by the complexity of the graph matching-based algorithm

Binary Only
Cross-Platform Support
High Precision
High Efficiency
Adaptive

The purpose of the research is to design a function to detect whether two functions are similar or not

State-of-the-art

Presented by David A.N

Existing Techniques (the bug-search problem)

Pairwise Graph Matching (Pewny, et al., 2015)
Graph Embedding (Feng et al., 2018)

Baseline 1: Bipartite Graph Matching (BGM)

Baseline 2: Codebook-based Graph Embedding (Genius)

Local Sensitive Hashing (LSH)
Attributed Control Flow Graph (ACFG)

Genius is a graph embedding workflow (Feng et al, 2016)

Block-level:
- String Constants
- Numeric Constants
- # of Transfer Instructions
- # of Calls
- # of Instructions
- # of Arithmetic Instructions
Inter-block
- # of Offsprings
- # Betweenness

Features or Basic-block Attributes

g = \{V,E\}

x_v =

Vertex-specific features

CFG and ACFG of a Binary Function

Limitations

Codebook generation is expensive
- Pairwise graph matching
- Spectral Clustering
Quality of the generated Codebook
Runtime overheads

Novelty

Presented by David A.N

Using DNN based approach to transform an ACFG into an embedding
- Better Accuracy
- Higher Embedding Efficiency
- Faster offline training

Contribution

Query Function

Target Functions

f_1,f_2

\pi(f_1,f_2) = 1

\pi(f_1,f_2) = -1

Better Accuracy

Iteratively propagating embedding throughout the CFG (instead of matching)

Embedding efficiency

Learn to minimize the distance between two embeddings of ACFG and to maximize dissimilar embeddings

Faster Offline Training

O(n^2)

O(n)

Distance Matrices

Epochs

Gemini, the solution

Presented by David A.N

Solution (Gemini)

Structure2Vec (adapted) + Siamese Network

Code Similarity is not a classification problem

Not looking for predicting binary code, not doing well on a predictive task

\phi

Although, training a NN to DO well on differentiating the similarity between inputs

Neural Network

Graph Embedding or Struct2Vec

Presented by David A.N

Graph embedding

\mu_v

p dimensional feature

\mu_g

embedding vector

\mu_g = A_{v \in V}(\mu_v)

\mu_g = \Sigma_{v \in V}(\mu_v)

A

Aggregation function

Graph embedding

\mu_1^0

\mu_2^0

\mu_3^0

Graph embedding

\mu_1^0

\mu_2^0

\mu_3^0

\mu_1^1

\mu_2^1

\mu_3^1

T iterations

Graph embedding

x_1

x_2

x_3

\mu_1^0

\mu_2^0

\mu_3^0

\mu_1^1

\mu_2^1

\mu_3^1

T iterations

Graph embedding

x_1

x_2

x_3

\mu_1^0

\mu_2^0

\mu_3^0

\mu_1^1

\mu_2^1

\mu_3^1

\mu_1^T

\mu_2^T

\mu_3^T

T iterations

Graph embedding

x_1

x_2

x_3

\mu_1^0

\mu_2^0

\mu_3^0

\mu_1^1

\mu_2^1

\mu_3^1

\mu_1^T

\mu_2^T

\mu_3^T

W_2 \times

T iterations

Graph embedding

x_1

x_2

x_3

\mu_1^0

\mu_2^0

\mu_3^0

\mu_1^1

\mu_2^1

\mu_3^1

\mu_1^T

\mu_2^T

\mu_3^T

W_2 \times

\mu

T iterations

The embedding vector

\mu_v^{(t+1)} = F(x_v, \Sigma_{u \in N(v)} \mu_u^{(t)}), \forall_v \in V

F(x_v, \Sigma_{u \in N(v)} \mu_u) = tanh(W_1x_v, \sigma(\Sigma_{u \in N(v)} \mu_u))

The embedding Network

Quick Discussion

Did the authors use feature engineering? If so, what type of features?
What is the most important representation the DNN needs to learn?

Siamese Network

Presented by David A.N

Training the model parameters with

Siamese Architecture

Evaluation

Presented by David A.N

Task-independent Pre-Training

Capturing invariant features of the function across different architectures and compilers
Assuming a set of source code is collected
- Compile the code

Task-specific Re-Training

Generate additional ACFG pairs from human experts to retrain the graph embedding network (e.g., 5 more epochs)

\langle g,g_1,+1 \rangle

\langle g,g_1,-1 \rangle

Same Source

Different Source

Hyperparameters

p = 64

n = 2

T = 5

Datasets

Dataset	Purpose	Source
I	Accuracy	OpenSSL
II	Task-specific	IoT divices
III	Efficiency	Firmware (large # of vertices)
IV	Vulnerability Case Study	Vulnerable functions

Accuracy

Discussion: why did the authors use ROC curves?

hyperparameters I

Discussion: can we tell that the model is overfitting? about the capacity?

hyperparameters II

Vulnerability

Similarity

Discussion: do you believe this is a good plot?

Machine Learning Basics

The essence of machine learning:

Basic premise of learning: "using a set of observations to uncover an underlying process"

A Shallow Tutorial of Deep Learning

Machine Learning and Evaluation Methods

If you have 1000 training examples, and the batch size is 500, then it will take 2 iterations to complete 1 epoch

Precision

Recall

Deep Learning and Similarity

Binary Code Similarity

Binary Code Similarity

The embedding world

Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection

The Problem

State-of-the-art

Existing Techniques (the bug-search problem)

Baseline 1: Bipartite Graph Matching (BGM)

Baseline 2: Codebook-based Graph Embedding (Genius)

Limitations

Novelty

Contribution

Better Accuracy

Iteratively propagating embedding throughout the CFG (instead of matching)

Embedding efficiency

Learn to minimize the distance between two embeddings of ACFG and to maximize dissimilar embeddings

Faster Offline Training

Gemini, the solution

Solution (Gemini)

Structure2Vec (adapted) + Siamese Network

Code Similarity is not a classification problem

Graph Embedding or Struct2Vec

Graph embedding

Graph embedding

Graph embedding

Graph embedding

Graph embedding

Graph embedding

Graph embedding

The embedding vector

The embedding Network

Quick Discussion

Siamese Network

Evaluation

Task-independent Pre-Training

Task-specific Re-Training

Datasets

Accuracy

hyperparameters I

hyperparameters II

Vulnerability

Similarity

Neural Network-based Graph Embeddings for Cross-Platform Binary Code Similarity Detection

More from David Nader Palacio