Machine Learning Basics

The essence of machine learning:

  • A pattern exists
  • We cannot pin it down mathematically
  • We have data on it

(Abu-Mostafa, 2012)

(Abu-Mostafa, 2012)

Can we learn the credit approval?

(Abu-Mostafa, 2012)

Components of Learning

(Abu-Mostafa, 2012)

Components of Learning

(Abu-Mostafa, 2012)

Solution Components

(Abu-Mostafa, 2012)

A simple hypothesis set - the perceptron

(Abu-Mostafa, 2012)

A simple hypothesis set - the perceptron

(Abu-Mostafa, 2012)

A simple hypothesis set - the perceptron

Feedforward neural network  is an ANN wherein connections between the nodes do not form a cycle

Basic premise of learning: "using a set of observations to uncover an underlying process"

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

(Abu-Mostafa, 2012)

(Abu-Mostafa, 2012)

Supervised learning

(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)

( input , correct outcome )

(Abu-Mostafa, 2012)

Unsupervised learning

(x_1, ?), (x_2, ?), ..., (x_n, ?)

( input , ? )

(Abu-Mostafa, 2012)

Reinforcement learning

(x_1, y_1, P(y_1)), (x_2, y_2, P(y_2)), ..., (x_n, y_n, P(y_n))

( input , some outcome, belief for the outcome )

(Abu-Mostafa, 2012)

Is Learning Feasible?

(Abu-Mostafa, 2012)

Summing up

A Shallow Tutorial of Deep Learning

The problem of representation and why representations matter

(Goodfellow, 2016)

Computational Graphs map inputs to outputs where nodes perform an operation (activation function)

(Goodfellow, 2016)

AI

Machine Learning

Representation (Feature) Learning

(Goodfellow, 2016)

AI

Machine Learning

Representation (Feature) Learning

Deep Learning

(Goodfellow, 2016)

AI

Probabilistic Reasoning

Machine Learning

Logistic Regression

Representation (Feature) Learning

Shallow Autoencoders

Deep Learning

MLPs

Text

(Goodfellow, 2016)

Depth and Repeated Composition

(Goodfellow, 2016)

Learning Multiple Components: how the different parts of an AI system relate to each other within different AI disciplines

(Goodfellow, 2016)

Input

Input

Input

Input

Hand-designed program

Hand-designed program

Features

Simple Features

Output

Output

Output

Output

Mapping from features

Mapping from features

Mapping from features

Additional layers of more abstract features

Learning Multiple Components: how the different parts of an AI system relate to each other within different AI disciplines

(Goodfellow, 2016)

Input

Input

Input

Input

Hand-designed program

Hand-designed program

Features

Simple Features

Output

Output

Output

Output

Mapping from features

Mapping from features

Mapping from features

Additional layers of more abstract features

Rule- based systems

Classic machine learning

Representation learning

Deep Learning

The many names and changing fortune of Neural Networks

(Goodfellow, 2016)

AlexNet (Hinton et al, 2012)

VGG (Simonyan et al, 2014)

GoogleNet (Szegedy et al, 2015)

Machine Learning and Evaluation Methods

Linear Regression

(Goodfellow, 2016)

Over-fitting and Under-fitting in Polynomial Estimation? 

(Goodfellow, 2016)

Over-fitting and Under-fitting in Polynomial Estimation? 

(Goodfellow, 2016)

Underfitting

Appropriate Capacity

Overfitting

Generalization Error is the difference between out-sample and in-sample error and the model Capacity is the ability to fit a variety of functions

(Goodfellow, 2016)

Effect of Training Set Size

(Goodfellow, 2016)

Weight Decay and Regularization

(Goodfellow, 2016)

J(w) = MSE_{train} + \lambda w^Tw
  • Epoch: One forward & backward pass of all the training examples. 
  • Batch Size: Number of training examples in one back/forward pass.  
  • Iterations: Number of passes (one forward + one backward), each pass using a fixed batch size. 

Fun evaluation terminology for Deep Learning Approaches

  • Epoch: One forward & backward pass of all the training examples. 
  • Batch Size: Number of training examples in one back/forward pass.  
  • Iterations: Number of passes (one forward + one backward), each pass using a fixed batch size. 

Fun evaluation terminology for Deep Learning Approaches

If you have 1000 training examples, and the batch size is 500, then it will take 2 iterations to complete 1 epoch

We might use ROC and Precision-Recall Curves to evaluate classification problems

(Devis & Goadrich, 2006)

Common machine learning evaluation metrics

(Devis & Goadrich, 2006)

Precision

  • How many selected items are relevant?
  • True Positives
  • Contrasting False Positives

Recall

  • How many relevant items are selected?
  • True Positives
  • Contrasting False Negatives
import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

TensorFlow Quick Tutorial

Dataset

Supervised: Training and Testing

Model Definition

Loss Function

Fitting

Evaluation

(Abu-Mostafa, 2012)

Topics out there

Deep Learning and Similarity

Binary Code Similarity

  • Malware detection
  • Vulnerability detection
  • Bug Search
  • Cross-Platform (x86, ARM, MIPS)
  • Plagiarism Detection
  • Traceability (?)

Related?

BinaryCode IOT 1

BinaryCode IOT 2

Binary Code Similarity

  • Goal: to detect similar functions directly in binary code.

Related?

BinaryCode IOT 1

BinaryCode IOT 2

The embedding world

  • Word2Vec (Mikolov, et al., 2013)
  • Doc2Vec (Mikolov, et al., 2014)
  • Structure2Vec (Dai, et al., 2016)

http://gear.github.io/2016-09-05-MAGE/

Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection

by Xiaojun Xu, et al.

Presented by David A.N

The Problem

Drawbacks in art approach, specifically, graph matching:

  1. The similarity function is hard to adapt to different applications
  2. The efficiency is bounded by the complexity of the graph matching-based algorithm
  • Binary Only
  • Cross-Platform Support
  • High Precision
  • High Efficiency 
  • Adaptive

The purpose of the research is to design a function to detect whether two functions are similar or not

State-of-the-art

Presented by David A.N

Existing Techniques (the bug-search problem)

  • Pairwise Graph Matching (Pewny, et al., 2015)
  • Graph Embedding (Feng et al., 2018)

Baseline 1: Bipartite Graph Matching (BGM)

Baseline 2: Codebook-based Graph Embedding (Genius)

  • Local Sensitive Hashing (LSH)
  • Attributed Control Flow Graph (ACFG)

Genius is a graph embedding workflow (Feng et al, 2016)

  • Block-level:
    • String Constants
    • Numeric Constants
    • # of Transfer Instructions
    • # of Calls
    • # of Instructions
    • # of Arithmetic Instructions
  • Inter-block
    • # of Offsprings
    • # Betweenness

Features or Basic-block Attributes

g = \{V,E\}
x_v =

Vertex-specific features

CFG and ACFG of a Binary Function

Limitations

  • Codebook generation is expensive
    • Pairwise graph matching
    • Spectral Clustering
  • Quality of the generated Codebook
  • Runtime overheads

Novelty

Presented by David A.N

  • Using DNN based approach to transform an ACFG into an embedding
    • Better Accuracy
    • Higher Embedding Efficiency
    • Faster offline training

Contribution

Query Function

Target Functions 

f_1,f_2
\pi(f_1,f_2) = 1
\pi(f_1,f_2) = -1

Better Accuracy

Iteratively propagating embedding throughout the CFG (instead of matching) 

Embedding efficiency

Learn to minimize the distance between two embeddings of ACFG and to maximize dissimilar embeddings

Faster Offline Training

O(n^2)
O(n)

Distance Matrices

Epochs

Gemini, the solution

Presented by David A.N

Solution (Gemini)

Structure2Vec (adapted) + Siamese Network

Code Similarity is not a classification problem

Not looking for predicting binary code, not doing well on a predictive task

\phi

Although, training a NN to DO well on differentiating the similarity between inputs

Neural Network

Graph Embedding or Struct2Vec

Presented by David A.N

Graph embedding

\mu_v

p dimensional feature

\mu_g

embedding vector

\mu_g = A_{v \in V}(\mu_v)
\mu_g = \Sigma_{v \in V}(\mu_v)
A

Aggregation function

Graph embedding

\mu_1^0
\mu_2^0
\mu_3^0

Graph embedding

\mu_1^0
\mu_2^0
\mu_3^0
\mu_1^1
\mu_2^1
\mu_3^1

T iterations

Graph embedding

x_1
x_2
x_3
\mu_1^0
\mu_2^0
\mu_3^0
\mu_1^1
\mu_2^1
\mu_3^1

T iterations

Graph embedding

x_1
x_2
x_3
\mu_1^0
\mu_2^0
\mu_3^0
\mu_1^1
\mu_2^1
\mu_3^1
\mu_1^T
\mu_2^T
\mu_3^T

T iterations

Graph embedding

x_1
x_2
x_3
\mu_1^0
\mu_2^0
\mu_3^0
\mu_1^1
\mu_2^1
\mu_3^1
\mu_1^T
\mu_2^T
\mu_3^T
W_2 \times

T iterations

Graph embedding

x_1
x_2
x_3
\mu_1^0
\mu_2^0
\mu_3^0
\mu_1^1
\mu_2^1
\mu_3^1
\mu_1^T
\mu_2^T
\mu_3^T
W_2 \times
\mu

T iterations

The embedding vector

\mu_v^{(t+1)} = F(x_v, \Sigma_{u \in N(v)} \mu_u^{(t)}), \forall_v \in V
F(x_v, \Sigma_{u \in N(v)} \mu_u) = tanh(W_1x_v, \sigma(\Sigma_{u \in N(v)} \mu_u))

The embedding Network

Quick Discussion

  • Did the authors use feature engineering? If so, what type of features? 
  • What is the most important representation the DNN needs to learn?

Siamese Network

Presented by David A.N

Training the model parameters with 

Siamese Architecture

Evaluation

Presented by David A.N

Task-independent Pre-Training

  • Capturing invariant features of the function across different architectures and compilers
  • Assuming a set of source code is collected
    • Compile the code

Task-specific Re-Training

Generate additional ACFG pairs from human experts to retrain the graph embedding network (e.g., 5 more epochs)

\langle g,g_1,+1 \rangle
\langle g,g_1,-1 \rangle

Same Source

Different Source

Hyperparameters

p = 64
n = 2
T = 5

Datasets

Dataset Purpose Source
I Accuracy  OpenSSL
II Task-specific IoT divices
III Efficiency Firmware (large # of vertices)
IV Vulnerability Case Study Vulnerable functions

Accuracy

Discussion: why did the authors use ROC curves?

hyperparameters I

Discussion: can we tell that the model is overfitting? about the capacity?

hyperparameters II

Vulnerability

Similarity

Discussion: do you believe this is a good plot?