Deep Code Search
By Xiaodong Gu, et al.
Presented by David A.N
Code Search
and the Reusability
Developers usually search and reuse previously written code by performing free-text queries...
Code Search Approaches
Approach | Technique | Novelty |
---|---|---|
Sourcerer (Linstead et al., 2009) | IR | Combines the textual content of a program with structural information. |
Portfolio (McMillan et al., 2011) | PageRank | Returns chain of functions through keyword matching. |
CodeHow (Lv et al., 2015) | IR | Combines text similarity matching and API matching. |
Code Search Approaches
Approach | Technique | Novelty |
---|---|---|
Sourcerer (Linstead et al., 2009) | IR | Combines the textual content of a program with structural information. |
Portfolio (McMillan et al., 2011) | PageRank | Returns chain of functions through keyword matching. |
CodeHow (Lv et al., 2015) | IR | Combines text similarity matching and API matching. |
Not Machine Learning
Deep Learning for Source Code
Approach | Novelty |
---|---|
White et al., 2016 | Predicting software tokens by using RNN language model. |
DEEPAPI (Gu, et al., 2016) | Deep learning method that learns the semantic of queries and the corresponding API sequences. |
The gap
and the dilemma of matching meaning from Source Code and Natural Language
IR-Based Code Search Fundamental Problem
-
Source code and Natural Language queries are heterogeneous
-
They may not share common lexical tokens, synonyms, or language structure
-
They can be only semantically related
Semantic Mapping
Query: "read an object from an XML"
Can anybody see the problem?
Intuition
- Embeddings techniques
- RNN for sequence embeddings
- Joint Embedding of Heterogeneous Data
Embedding Techniques
- CBOW
- Skip-Gram
A sentence can also be embedded as a vector.
execute = [0.12, -0.32]
run = [0.42, -0.52]
RNN for Sequence Embeddings
s=w_1,...,w_T
w_t=V(w_t)
V(w_t) \in \mathbb{R}^d
RNN for Sequence Embeddings
s=w_1,...,w_T
w_t=V(w_t)
V(w_t) \in \mathbb{R}^d
RNN for Sequence Embeddings
h_t = tanh(W[h_{t-1};w_t]), \forall_t = 1,2,...,T
RNN for Sequence Embeddings
h_t = tanh(W[h_{t-1};w_t]), \forall_t = 1,2,...,T
RNN for Sequence Embeddings
h_t = tanh(W[h_{t-1};w_t]), \forall_t = 1,2,...,T
2 ways for embedding:
- using last NN
- using maxpooling
s=maxpooling([h_1,...,h_T])
RNN for Sequence Embeddings
Joint Embeddings
f:x \to y
x \rightarrow^\phi V_x \rightarrow J(V_x,Vy) \leftarrow V_y \leftarrow^\psi y
Joint Embeddings
f:x \to y
x \rightarrow^\phi V_x \rightarrow J(V_x,Vy) \leftarrow V_y \leftarrow^\psi y
Joint Embeddings
f:x \to y
x \rightarrow^\phi V_x \rightarrow J(V_x,Vy) \leftarrow V_y \leftarrow^\psi y
Joint Embeddings
f:x \to y
x \rightarrow^\phi V_x \rightarrow J(V_x,Vy) \leftarrow V_y \leftarrow^\psi y
Solution/Approach
COde Description Embedding Neural Network (CODEnn)
Code and Queries in the same space
Code and Queries in the same space
CoNN
Code and Queries in the same space
CoNN
DeNN
Huge Multi-Layer Perceptron
Huge Multi-Layer Perceptron
Huge Multi-Layer Perceptron
Huge Multi-Layer Perceptron
Huge Multi-Layer Perceptron
Huge Multi-Layer Perceptron
Huge Multi-Layer Perceptron
Model Training
The hinge loss
Model Training
Parameters of NN
Model Training
Training Tuple
Model Training
Constant Margin
Model Training
Embedded Vectors
DEEPCS
as a code search tool
Workflow
Workflow
Workflow
Workflow
The magic extraction
Preprocessing as camel-case and AST parsing.
Source: GitHub and JavaDoc
The magic extraction
Preprocessing as camel-case and AST parsing.
Source: GitHub and JavaDoc
Evaluation
- 18,233,872 commented Java Methods
- Bi-directional LSTM
- d = 100
- Batch = 128
- Keras & Theano
Performance Measures
- SuccessRate@K: measures the percentage of queries for which more than one correct result could exist in the top k ranked results.
- FRank: is the rank of the first hit result in the result list; the smaller, the lowest inspection effort.
- Precision@K: measures the percentage of a relevant result in the returned top k.
- MRR: is the average of the reciprocal ranks of the results of a set of queries.
Baseline vs DeepCS in terms of accuracy
Baseline vs DeepCS in terms of accuracy
Discussion
Why does DeepCS work?
Does DeepCS work?
:|
Discussion
- What are the advantages of using a Recurrent Neural Network instead of a typical multilayer perceptron?
- How can the Joint Embedding be improved?
- Are there any ways of transferring information among datasets different from common error training?
Deep Code Search
By David Nader Palacio
Deep Code Search
- 262