On the use of Encoder-Decoder Inference Model for Reverse Engineering Machine Code 

by

David N. & Richard B.

[William & Mary]

Background

Artificial Neural Networks & Decompilation

Neural Machine Translation: translate from English to Spanish

I am a computer scientist

Soy un científico de la computación

English

Spanish

Neural Machine Translation: encoder-decoder architecture that converts a sentence into a "thought" vector which is used by a decoder to generate a translation

I am a computer scientist

Encoder

Decoder

Soy un científico de la computación

\begin{pmatrix} 1 \\ 2 \end{pmatrix}
(12) \begin{pmatrix} 1 \\ 2 \end{pmatrix}

English

Spanish

Tensors: geometric objects that describe linear relations and generalize the concept of scalar, vector, and matrix 

1
11
\begin{pmatrix} 1 \\ 2 \end{pmatrix}
(12) \begin{pmatrix} 1 \\ 2 \end{pmatrix}

Scalar

Vector

Matrix

Tensor

\begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}
(1324) \begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}

Tensors: geometric objects that describe linear relations and generalize the concept of scalar, vector, and matrix 

1
11
\begin{pmatrix} 1 \\ 2 \end{pmatrix}
(12) \begin{pmatrix} 1 \\ 2 \end{pmatrix}

Scalar

Vector

Matrix

Tensor

\begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}
(1324) \begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}
\begin{pmatrix} (1 & 2) & (3 & 2) \\ (1 & 7) & (5 & 4) \end{pmatrix}
((12)(32)(17)(54)) \begin{pmatrix} (1 & 2) & (3 & 2) \\ (1 & 7) & (5 & 4) \end{pmatrix}

LLVM: compiler infrastructure written in c++, we used it to translate snippets into machine code

LLVM

Snippet

Machine Code

Apply the below optimization/configuration

unsigned square_int(unsigned a) {
    return a*a;
}
$ llc-3.0 -O3 sample.ll -march=x86-64 -o sample-x86-64.s

LLVM

Snippet

Machine Code

Intermediate Representation (IR)

unsigned square_int(unsigned a) {
    return a*a;
}
define i32 @square_unsigned(i32 %a) {
  %1 = mul i32 %a, %a
  ret i32 %1
}
$ llc-3.0 -O3 sample.ll -march=x86-64 -o sample-x86-64.s

LLVM

Snippet

Machine Code

Used Machine Code

unsigned square_int(unsigned a) {
    return a*a;
}
define i32 @square_unsigned(i32 %a) {
  %1 = mul i32 %a, %a
  ret i32 %1
}
square_unsigned:
        imull   %edi, %edi
        movl    %edi, %eax
        ret
$ llc-3.0 -O3 sample.ll -march=x86-64 -o sample-x86-64.s

LLVM

Snippet

Machine Code

Approach

Neural Machine Translation: from machine code to snippet

Encoder

Decoder

\begin{pmatrix} 0 \\ 1 \\ 0 \\ 1 \end{pmatrix}
(0101) \begin{pmatrix} 0 \\ 1 \\ 0 \\ 1 \end{pmatrix}

Machine Code

Snippet

unsigned square_int(unsigned a) {
    return a*a;
}
square_unsigned:
        imull   %edi, %edi
        movl    %edi, %eax
        ret
  • Only machine code
  • Malware or Vulnerabilities (Powlowski, et al, 2017)
  • State-of-the-art: Heuristics (SANER2018)
  • Intuitive idea of language translators (SANER2018)

machine code

snippet

Reverse Engineering

Goal

To implement a DNN (seq2seq) that reverses a LLVM-based machine code using encoder-decoder inference model

Research Question

to what extent is the Neural Machine Translation model accurate for reversing machine code into c/c++ snippets?

Methodology 

Dataset Pre-processing Eval. Metrics Architecture Type of Validation Training
OpenSSL (sampling) One-hot embedding - Accuracy
- Loss
-LSTM for encoder-decoder
-Inference LSTM
-Automatic Verification Dataset (split:0.2) -Batch:64
-Epochs:50
- - - - - -

Methodology 

  1. TensorFlow + Keras Frameworks
  2. DataSet generated by using LLVM
  3. Controlling measure: Categorical Cross-Entropy
  4. Tensor Shape: (batch, sequence length, vocabulary size)
Dataset Pre-processing Eval. Metrics Architecture Type of Validation Training
OpenSSL (sampling) One-hot embedding - Accuracy
- Loss
-LSTM for encoder-decoder
-Inference LSTM
-Automatic Verification Dataset (split:0.2) -Batch:64
-Epochs:50
- - - - - -

Architecture for Neural Machine Translation

Pairs of <source code, machine code> are the inputs 

u

n

s

\t

s

q

%

e

snippet (char by char)

machine code

The embedding is by creating one-hot character

u

n

s

\t

s

q

%

e

embedding for snippet

machine code

.

Feed char by char embeds into the encoder 

u

n

s

\t

s

q

%

e

embedding for snippet

machine code

.

encoder

Generate the final encoders states (h & c) and feed them into the decoder

u

n

s

\t

s

q

%

e

embedding for snippet

machine code

.

encoder

decoder

The decoder receives 3 inputs (encoder states & machine code embeds)

u

n

s

\t

s

q

%

e

embedding for snippet

machine code

.

encoder

decoder

embedding for machine

Predict word probabilities in the softmax layer

u

n

s

\t

s

q

%

e

embedding for snippet

machine code

.

encoder

decoder

embedding for machine

softmax

We use the categorical cross entropy as loss function (learning to classify correct machine chars)

u

n

s

\t

s

q

%

e

embedding for snippet

machine code

.

encoder

decoder

embedding for machine

softmax

embedding for target

labelled translation

s

q

e

We use the categorical cross entropy as loss function (learning to classify correct machine chars)

u

n

s

\t

s

q

%

e

5. embedding for snippet

1. machine code

.

3. encoder

6. decoder

2. embedding for machine

7. softmax

8. embedding for target

9. labelled translation

s

q

e

4. snippet

Corpus Generation

Attributed Control Flow Graphs

  • Introduced by the authors of Genius (Qian Feng et al.)
  • Adds basic block features to the vertices of a CFG

Attributed Control Flow Graphs

 

IDA Pro

 

Genius

 

IDA Free

 

Genius

 

Radare2

 

Reimplementation

Significant Cost

No Plugin Integration

Substantial Engineering

Collecting "Source \t Assembly" Snippets

 

OpenSSL v1.0.1f

 

Parse source files with Clang, extract functions from AST

Compile (Clang v6.0.0)

Disassemble (Radare2), extract functions

Set of functions in Assembly

Set of functions in C

Corpus

Initial Results

Model accuracy: training set vs validation set

Model loss: training set vs validation set

Encoder - Decoder Inference Model

u

n

s

\t

s

%

e

embedding for snippet

machine code

.

encoder

decoder

embedding for machine

softmax

embedding for target

labelled translation

s

q

e

Softmax layer: argmax of word probabilies

u

n

s

\t

embedding for snippet

machine code

.

encoder

decoder

embedding for machine

softmax

embedding for target

argmax

Reverse dictionary lookup in target machine code

u

n

s

\t

embedding for snippet

machine code

.

encoder

decoder

embedding for machine

softmax

Reverse Dictionary

argmax

Decoder predicts the first char

u

n

s

\t

embedding for snippet

machine code

.

encoder

decoder

embedding for machine

softmax

Reverse Dictionary

predicted machine code

s

argmax

Predicted char as decoder input

u

n

s

\t

s

embedding for snippet

machine code

.

encoder

decoder

embedding for machine

softmax

Reverse Dictionary

predicted machine code

s

argmax

Infer next char

u

n

s

\t

s

2. embedding for snippet

1. machine code

.

3. encoder

6. decoder

5. embedding for machine

softmax

8. Reverse Dictionary

9. predicted machine code

s

7. argmax

4. snippet

test_vprintf_stderr: 
mov rax, qword [rip] 
mov rdx, rsi 
mov rsi, rdi 
mov rdi, qword [rax] 
jmp 0x15

From the following machine code

Input Machine Sentence

test_vprintf_stderr: 
mov rax, qword [rip] 
mov rdx, rsi 
mov rsi, rdi 
mov rdi, qword [rax] 
jmp 0x15
int EVP_MD_meth_get_cb(SSL *s, unsigned int flags) { 
if (ctx->seckey_ercory_teck = (const_unse_ned_canctx, corst unsigned char *), 
unsigned char *)); 
} 

From the following machine code, infer the following snippet

Input Machine Sentence

Decoded Snippet Setence

Conclusion

Dataset Pre-processing Eval. Metrics Architecture Type of Validation Training
OpenSSL (sampling) One-hot embedding - Accuracy
- Loss
-LSTM for encoder-decoder
-Inference LSTM
-Automatic Verification Dataset (split:0.2) -Batch:64
-Epochs:50
- - - - - -

References

  • [1] Struct2Vec https://arxiv.org/abs/1708.06525
  • [2] Feature Engineering http://web.cs.ucdavis.edu/~su/publications/icse08-clone.pdf

  • [3] Using Recurrent Neural Networks for Decompilation [SANER2018]

  • [4] https://www.vvdveen.com/publications/marx.pdf

Reverse Engineering from Machine to C code

By David Nader Palacio

Reverse Engineering from Machine to C code

Proposal & Milestone

  • 303