Machine Learning Basics
The essence of machine learning:
- A pattern exists
- We cannot pin it down mathematically
- We have data on it
(Abu-Mostafa, 2012)
(Abu-Mostafa, 2012)
Can we learn the credit approval?
(Abu-Mostafa, 2012)
Components of Learning
(Abu-Mostafa, 2012)
Components of Learning
(Abu-Mostafa, 2012)
Solution Components
(Abu-Mostafa, 2012)
A simple hypothesis set - the perceptron
(Abu-Mostafa, 2012)
A simple hypothesis set - the perceptron
(Abu-Mostafa, 2012)
A simple hypothesis set - the perceptron
Feedforward neural network is an ANN wherein connections between the nodes do not form a cycle
Basic premise of learning: "using a set of observations to uncover an underlying process"
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
(Abu-Mostafa, 2012)
(Abu-Mostafa, 2012)
Supervised learning
(
(Abu-Mostafa, 2012)
Unsupervised learning
( input , ? )
(Abu-Mostafa, 2012)
Reinforcement learning
( input , some outcome, belief for the outcome )
(Abu-Mostafa, 2012)
Is Learning Feasible?
(Abu-Mostafa, 2012)
Summing up
A Shallow Tutorial of Deep Learning
The problem of representation and why representations matter
(Goodfellow, 2016)
Computational Graphs map inputs to outputs where nodes perform an operation (activation function)
(Goodfellow, 2016)
AI
Machine Learning
Representation (Feature) Learning
(Goodfellow, 2016)
AI
Machine Learning
Representation (Feature) Learning
Deep Learning
(Goodfellow, 2016)
AI
Probabilistic Reasoning
Machine Learning
Logistic Regression
Representation (Feature) Learning
Shallow Autoencoders
Deep Learning
MLPs
Text
(Goodfellow, 2016)
Depth and Repeated Composition
(Goodfellow, 2016)
Learning Multiple Components: how the different parts of an AI system relate to each other within different AI disciplines
(Goodfellow, 2016)
Input
Input
Input
Input
Hand-designed program
Hand-designed program
Features
Simple Features
Output
Output
Output
Output
Mapping from features
Mapping from features
Mapping from features
Additional layers of more abstract features
Learning Multiple Components: how the different parts of an AI system relate to each other within different AI disciplines
(Goodfellow, 2016)
Input
Input
Input
Input
Hand-designed program
Hand-designed program
Features
Simple Features
Output
Output
Output
Output
Mapping from features
Mapping from features
Mapping from features
Additional layers of more abstract features
Rule- based systems
Classic machine learning
Representation learning
Deep Learning
The many names and changing fortune of Neural Networks
(Goodfellow, 2016)
AlexNet (Hinton et al, 2012)
VGG (Simonyan et al, 2014)
GoogleNet (Szegedy et al, 2015)
Machine Learning and Evaluation Methods
Linear Regression
(Goodfellow, 2016)
Over-fitting and Under-fitting in Polynomial Estimation?
(Goodfellow, 2016)
Over-fitting and Under-fitting in Polynomial Estimation?
(Goodfellow, 2016)
Underfitting
Appropriate Capacity
Overfitting
Generalization Error is the difference between out-sample and in-sample error and the model Capacity is the ability to fit a variety of functions
(Goodfellow, 2016)
Effect of Training Set Size
(Goodfellow, 2016)
Weight Decay and Regularization
(Goodfellow, 2016)
- Epoch: One forward & backward pass of all the training examples.
- Batch Size: Number of training examples in one back/forward pass.
- Iterations: Number of passes (one forward + one backward), each pass using a fixed batch size.
Fun evaluation terminology for Deep Learning Approaches
- Epoch: One forward & backward pass of all the training examples.
- Batch Size: Number of training examples in one back/forward pass.
- Iterations: Number of passes (one forward + one backward), each pass using a fixed batch size.
Fun evaluation terminology for Deep Learning Approaches
If you have 1000 training examples, and the batch size is 500, then it will take 2 iterations to complete 1 epoch
We might use ROC and Precision-Recall Curves to evaluate classification problems
(Devis & Goadrich, 2006)
Common machine learning evaluation metrics
(Devis & Goadrich, 2006)
Precision
- How many selected items are relevant?
- True Positives
- Contrasting False Positives
Recall
- How many relevant items are selected?
- True Positives
- Contrasting False Negatives
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)
TensorFlow Quick Tutorial
Dataset
Supervised: Training and Testing
Model Definition
Loss Function
Fitting
Evaluation
(Abu-Mostafa, 2012)
Topics out there
Deep Learning and Similarity
Binary Code Similarity
- Malware detection
- Vulnerability detection
- Bug Search
- Cross-Platform (x86, ARM, MIPS)
- Plagiarism Detection
- Traceability (?)
Related?
BinaryCode IOT 1
BinaryCode IOT 2
Binary Code Similarity
- Goal: to detect similar functions directly in binary code.
Related?
BinaryCode IOT 1
BinaryCode IOT 2
The embedding world
- Word2Vec (Mikolov, et al., 2013)
- Doc2Vec (Mikolov, et al., 2014)
- Structure2Vec (Dai, et al., 2016)
http://gear.github.io/2016-09-05-MAGE/
Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection
by Xiaojun Xu, et al.
Presented by David A.N
The Problem
Drawbacks in art approach, specifically, graph matching:
- The similarity function is hard to adapt to different applications
- The efficiency is bounded by the complexity of the graph matching-based algorithm
- Binary Only
- Cross-Platform Support
- High Precision
- High Efficiency
- Adaptive
The purpose of the research is to design a function to detect whether two functions are similar or not
State-of-the-art
Presented by David A.N
Existing Techniques (the bug-search problem)
- Pairwise Graph Matching (Pewny, et al., 2015)
- Graph Embedding (Feng et al., 2018)
Baseline 1: Bipartite Graph Matching (BGM)
Baseline 2: Codebook-based Graph Embedding (Genius)
- Local Sensitive Hashing (LSH)
- Attributed Control Flow Graph (ACFG)
Genius is a graph embedding workflow (Feng et al, 2016)
- Block-level:
- String Constants
- Numeric Constants
- # of Transfer Instructions
- # of Calls
- # of Instructions
- # of Arithmetic Instructions
- Inter-block
- # of Offsprings
- # Betweenness
Features or Basic-block Attributes
Vertex-specific features
CFG and ACFG of a Binary Function
Limitations
- Codebook generation is expensive
- Pairwise graph matching
- Spectral Clustering
- Quality of the generated Codebook
- Runtime overheads
Novelty
Presented by David A.N
- Using DNN based approach to transform an ACFG into an embedding
- Better Accuracy
- Higher Embedding Efficiency
- Faster offline training
Contribution
Query Function
Target Functions
Better Accuracy
Iteratively propagating embedding throughout the CFG (instead of matching)
Embedding efficiency
Learn to minimize the distance between two embeddings of ACFG and to maximize dissimilar embeddings
Faster Offline Training
Distance Matrices
Epochs
Gemini, the solution
Presented by David A.N
Solution (Gemini)
Structure2Vec (adapted) + Siamese Network
Code Similarity is not a classification problem
Not looking for predicting binary code, not doing well on a predictive task
Although, training a NN to DO well on differentiating the similarity between inputs
Neural Network
Graph Embedding or Struct2Vec
Presented by David A.N
Graph embedding
p dimensional feature
embedding vector
Aggregation function
Graph embedding
Graph embedding
T iterations
Graph embedding
T iterations
Graph embedding
T iterations
Graph embedding
T iterations
Graph embedding
T iterations
The embedding vector
The embedding Network
Quick Discussion
- Did the authors use feature engineering? If so, what type of features?
- What is the most important representation the DNN needs to learn?
Siamese Network
Presented by David A.N
Training the model parameters with
Siamese Architecture
Evaluation
Presented by David A.N
Task-independent Pre-Training
- Capturing invariant features of the function across different architectures and compilers
- Assuming a set of source code is collected
- Compile the code
Task-specific Re-Training
Generate additional ACFG pairs from human experts to retrain the graph embedding network (e.g., 5 more epochs)
Same Source
Different Source
Hyperparameters
Datasets
Dataset | Purpose | Source |
---|---|---|
I | Accuracy | OpenSSL |
II | Task-specific | IoT divices |
III | Efficiency | Firmware (large # of vertices) |
IV | Vulnerability Case Study | Vulnerable functions |
Accuracy
Discussion: why did the authors use ROC curves?
hyperparameters I
Discussion: can we tell that the model is overfitting? about the capacity?
hyperparameters II
Vulnerability
Similarity
Discussion: do you believe this is a good plot?
Neural Network-based Graph Embeddings for Cross-Platform Binary Code Similarity Detection
By David Nader Palacio
Neural Network-based Graph Embeddings for Cross-Platform Binary Code Similarity Detection
First Software Engineering Presentation
- 487