Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk
A code fragment is a contiguous segment of source code. Code clones are two or more fragments that are similar with respect to a clone type.
Identical up to variations in comments, whitespace, or layout [Roy'07]
if (a >= b) {
c = d + b; // Comment1
d = d + 1;}
else
c = d - a; //Comment2
if (a>=b) {
// Comment1'
c=d+b;
d=d+1;}
else // Comment2'
c=d-a;
if (a>=b)
{ // Comment1''
c=d+b;
d=d+1;
}
else // Comment2''
c=d-a;
Identical up to variations in names and values, comments, etc. [Roy'07]
if (a >= b) {
c = d + b; // Comment1
d = d + 1;}
else
c = d - a; //Comment2
if (m >= n)
{ // Comment1'
y = x + n;
x = x + 5; //Comment3
}
else
y = x - m; //Comment2'
A parameterized clone for this fragment is
Modifications include statement(s) changed, added, or deleted [Roy'07]
public int getSoLinger() throws SocketException {
Object o = impl.getOption(SocketOptions.SO_LINGER);
if (o instanceof Integer) {
return((Integer) o).intValue();
}
else return -1;
}
public synchronized int getSoTimeout() // This statement is changed
throws SocketException {
Object o = impl.getOption(SocketOptions.SO_TIMEOUT);
if (o instanceof Integer) {
return((Integer) o).intValue();
}
else return -0;
}
Syntactically dissimilar fragments with similar functionality [Roy'07]
int i, j=1;
for (i=1; i<=VALUE; i++)
j=j*i;
int factorial(int n) {
if (n == 0) return 1 ;
else return n * factorial(n-1) ;
}
Now consider a recursive function that calculates the factorial
Techniques can be classified by their source code representation
public int getSoLinger() throws SocketException {
Object o = impl.getOption(SocketOptions.SO_LINGER);
if (o instanceof Integer) {
return((Integer) o).intValue();
}
else return -1;
Our new set of techniques fuse and use
Our approach couples deep learners to front end compiler stages
ASTs can have any number of levels comprising nodes with arbitrary degree. ast2bin fixes the size of the input, and recursion models different levels.
Case I
Case II
Use a grammar to handle nodes with degree greater than two
Establish a precedence to handle nodes with degree one
Effectively model sequences of terms in a source code corpus
We use a recurrent neural network to map terms to embeddings
What we would like to have is not only embeddings for terms but also embeddings for fragments
Generalize recurrent neural networks by modeling structures
We use a recursive autoencoder to encode sequences of embeddings
AST-based encoding
Greedy encoding
System | Files | LOC | Tokens | |V| |
ANTLR 4 | 1,514 | 104,225 | 1,701,807 | 15,826 |
Apache Ant 1.9.6 | 1,218 | 136,352 | 1,888,424 | 16,029 |
ArgoUML 0.34 | 1,908 | 177,493 | 1,172,058 | 17,205 |
CAROL 2.0.5 | 1,184 | 112,022 | 1,180,947 | 12,210 |
dnsjava 2.0.0 | 1,196 | 124,660 | 1,169,219 | 13,012 |
Hibernate 2 | 1,555 | 151,499 | 1,365,256 | 15,850 |
JDK 1.4.2 | 4,129 | 562,120 | 3,512,807 | 45,107 |
JHotDraw 6 | 1,984 | 158,130 | 1,377,652 | 14,803 |
System | Files | LOC | Tokens | |V| |
ANTLR 4 | 1,514 | 104,225 | 1,701,807 | 15,826 |
Apache Ant 1.9.6 | 1,218 | 136,352 | 1,888,424 | 16,029 |
ArgoUML 0.34 | 1,908 | 177,493 | 1,172,058 | 17,205 |
CAROL 2.0.5 | 1,184 | 112,022 | 1,180,947 | 12,210 |
dnsjava 2.0.0 | 1,196 | 124,660 | 1,169,219 | 13,012 |
Hibernate 2 | 1,555 | 151,499 | 1,365,256 | 15,850 |
JDK 1.4.2 | 4,129 | 562,120 | 3,512,807 | 45,107 |
JHotDraw 6 | 1,984 | 158,130 | 1,377,652 | 14,803 |
System | Files | LOC | Tokens | |V| |
ANTLR 4 | 1,514 | 104,225 | 1,701,807 | 15,826 |
Apache Ant 1.9.6 | 1,218 | 136,352 | 1,888,424 | 16,029 |
ArgoUML 0.34 | 1,908 | 177,493 | 1,172,058 | 17,205 |
CAROL 2.0.5 | 1,184 | 112,022 | 1,180,947 | 12,210 |
dnsjava 2.0.0 | 1,196 | 124,660 | 1,169,219 | 13,012 |
Hibernate 2 | 1,555 | 151,499 | 1,365,256 | 15,850 |
JDK 1.4.2 | 4,129 | 562,120 | 3,512,807 | 45,107 |
JHotDraw 6 | 1,984 | 158,130 | 1,377,652 | 14,803 |
System | AST-based | Greedy | AST-based | Greedy |
ANTLR | 197 | 100 | 100 | 100 |
Apache Ant | 192 | 193 | 100 | 100 |
ArgoUML | 190 | 100 | 100 | 100 |
CAROL | 100 | 100 | 100 | 100 |
dnsjava | 147 | 100 | 173 | 187 |
Hibernate | 100 | 100 | 153 | 170 |
JDK | 190 | 100 | 100 | 100 |
JHotDraw | 100 | 100 | 100 | 100 |
File-level
Method-level
Precision Results (%)
Recurrent Neural Network
AST-based
Greedy