On the use of Encoder-Decoder Inference Model for Reverse Engineering Machine Code

David N. & Richard B.

[William & Mary]

Background

Artificial Neural Networks & Decompilation

Neural Machine Translation: translate from English to Spanish

I am a computer scientist

Soy un científico de la computación

English

Spanish

Neural Machine Translation: encoder-decoder architecture that converts a sentence into a "thought" vector which is used by a decoder to generate a translation

I am a computer scientist

Encoder

Decoder

Soy un científico de la computación

\begin{pmatrix} 1 \\ 2 \end{pmatrix}

\begin{pmatrix} 1 \\ 2 \end{pmatrix}

English

Spanish

Tensors: geometric objects that describe linear relations and generalize the concept of scalar, vector, and matrix

1

\begin{pmatrix} 1 \\ 2 \end{pmatrix}

\begin{pmatrix} 1 \\ 2 \end{pmatrix}

Scalar

Vector

Matrix

Tensor

\begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}

\begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}

Tensors: geometric objects that describe linear relations and generalize the concept of scalar, vector, and matrix

1

\begin{pmatrix} 1 \\ 2 \end{pmatrix}

\begin{pmatrix} 1 \\ 2 \end{pmatrix}

Scalar

Vector

Matrix

Tensor

\begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}

\begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}

\begin{pmatrix} (1 & 2) & (3 & 2) \\ (1 & 7) & (5 & 4) \end{pmatrix}

\begin{pmatrix} (1 & 2) & (3 & 2) \\ (1 & 7) & (5 & 4) \end{pmatrix}

LLVM: compiler infrastructure written in c++, we used it to translate snippets into machine code

LLVM

Snippet

Machine Code

Apply the below optimization/configuration

unsigned square_int(unsigned a) {
    return a*a;
}

$ llc-3.0 -O3 sample.ll -march=x86-64 -o sample-x86-64.s

LLVM

Snippet

Machine Code

Intermediate Representation (IR)

unsigned square_int(unsigned a) {
    return a*a;
}

define i32 @square_unsigned(i32 %a) {
  %1 = mul i32 %a, %a
  ret i32 %1
}

$ llc-3.0 -O3 sample.ll -march=x86-64 -o sample-x86-64.s

LLVM

Snippet

Machine Code

Used Machine Code

unsigned square_int(unsigned a) {
    return a*a;
}

define i32 @square_unsigned(i32 %a) {
  %1 = mul i32 %a, %a
  ret i32 %1
}

square_unsigned:
        imull   %edi, %edi
        movl    %edi, %eax
        ret

$ llc-3.0 -O3 sample.ll -march=x86-64 -o sample-x86-64.s

LLVM

Snippet

Machine Code

Approach

Neural Machine Translation: from machine code to snippet

Encoder

Decoder

\begin{pmatrix} 0 \\ 1 \\ 0 \\ 1 \end{pmatrix}

\begin{pmatrix} 0 \\ 1 \\ 0 \\ 1 \end{pmatrix}

Machine Code

Snippet

unsigned square_int(unsigned a) {
    return a*a;
}

square_unsigned:
        imull   %edi, %edi
        movl    %edi, %eax
        ret

Only machine code
Malware or Vulnerabilities (Powlowski, et al, 2017)
State-of-the-art: Heuristics (SANER2018)
Intuitive idea of language translators (SANER2018)

machine code

snippet

Reverse Engineering

Goal

To implement a DNN (seq2seq) that reverses a LLVM-based machine code using encoder-decoder inference model

Research Question

to what extent is the Neural Machine Translation model accurate for reversing machine code into c/c++ snippets?

Methodology

Dataset	Pre-processing	Eval. Metrics	Architecture	Type of Validation	Training
OpenSSL (sampling)	One-hot embedding	- Accuracy - Loss	-LSTM for encoder-decoder -Inference LSTM	-Automatic Verification Dataset (split:0.2)	-Batch:64 -Epochs:50
-	-	-	-	-	-

Methodology

TensorFlow + Keras Frameworks
DataSet generated by using LLVM
Controlling measure: Categorical Cross-Entropy
Tensor Shape: (batch, sequence length, vocabulary size)

Dataset	Pre-processing	Eval. Metrics	Architecture	Type of Validation	Training
OpenSSL (sampling)	One-hot embedding	- Accuracy - Loss	-LSTM for encoder-decoder -Inference LSTM	-Automatic Verification Dataset (split:0.2)	-Batch:64 -Epochs:50
-	-	-	-	-	-

Architecture for Neural Machine Translation

Pairs of <source code, machine code> are the inputs

snippet (char by char)

machine code

The embedding is by creating one-hot character

embedding for snippet

machine code

Feed char by char embeds into the encoder

embedding for snippet

machine code

encoder

Generate the final encoders states (h & c) and feed them into the decoder

embedding for snippet

machine code

encoder

decoder

The decoder receives 3 inputs (encoder states & machine code embeds)

embedding for snippet

machine code

encoder

decoder

embedding for machine

Predict word probabilities in the softmax layer

embedding for snippet

machine code

encoder

decoder

embedding for machine

softmax

We use the categorical cross entropy as loss function (learning to classify correct machine chars)

embedding for snippet

machine code

encoder

decoder

embedding for machine

softmax

embedding for target

labelled translation

We use the categorical cross entropy as loss function (learning to classify correct machine chars)

5. embedding for snippet

1. machine code

3. encoder

6. decoder

2. embedding for machine

7. softmax

8. embedding for target

9. labelled translation

4. snippet

Corpus Generation

Attributed Control Flow Graphs

Introduced by the authors of Genius (Qian Feng et al.)
Adds basic block features to the vertices of a CFG

Attributed Control Flow Graphs

IDA Pro

Genius

IDA Free

Genius

Radare2

Reimplementation

Significant Cost

No Plugin Integration

Substantial Engineering

Collecting "Source \t Assembly" Snippets

OpenSSL v1.0.1f

Parse source files with Clang, extract functions from AST

Compile (Clang v6.0.0)

Disassemble (Radare2), extract functions

Set of functions in Assembly

Set of functions in C

Corpus

Initial Results

Model accuracy: training set vs validation set

Model loss: training set vs validation set

Encoder - Decoder Inference Model

embedding for snippet

machine code

encoder

decoder

embedding for machine

softmax

embedding for target

labelled translation

Softmax layer: argmax of word probabilies

embedding for snippet

machine code

encoder

decoder

embedding for machine

softmax

embedding for target

argmax

Reverse dictionary lookup in target machine code

embedding for snippet

machine code

encoder

decoder

embedding for machine

softmax

Reverse Dictionary

argmax

Decoder predicts the first char

embedding for snippet

machine code

encoder

decoder

embedding for machine

softmax

Reverse Dictionary

predicted machine code

argmax

Predicted char as decoder input

embedding for snippet

machine code

encoder

decoder

embedding for machine

softmax

Reverse Dictionary

predicted machine code

argmax

Infer next char

2. embedding for snippet

1. machine code

3. encoder

6. decoder

5. embedding for machine

softmax

8. Reverse Dictionary

9. predicted machine code

7. argmax

4. snippet

test_vprintf_stderr: 
mov rax, qword [rip] 
mov rdx, rsi 
mov rsi, rdi 
mov rdi, qword [rax] 
jmp 0x15

From the following machine code

Input Machine Sentence

test_vprintf_stderr: 
mov rax, qword [rip] 
mov rdx, rsi 
mov rsi, rdi 
mov rdi, qword [rax] 
jmp 0x15

int EVP_MD_meth_get_cb(SSL *s, unsigned int flags) { 
if (ctx->seckey_ercory_teck = (const_unse_ned_canctx, corst unsigned char *), 
unsigned char *)); 
}

From the following machine code, infer the following snippet

Input Machine Sentence

Decoded Snippet Setence

Conclusion

Dataset	Pre-processing	Eval. Metrics	Architecture	Type of Validation	Training
OpenSSL (sampling)	One-hot embedding	- Accuracy - Loss	-LSTM for encoder-decoder -Inference LSTM	-Automatic Verification Dataset (split:0.2)	-Batch:64 -Epochs:50
-	-	-	-	-	-

References

[1] Struct2Vec https://arxiv.org/abs/1708.06525
[2] Feature Engineering http://web.cs.ucdavis.edu/~su/publications/icse08-clone.pdf
[3] Using Recurrent Neural Networks for Decompilation [SANER2018]
[4] https://www.vvdveen.com/publications/marx.pdf