Using Natural Language Processing techniques for automated code refactoring
>Alan Barzilay
_
{summary}
Introduction
Goals
Background
DataSet
Model
Results
Naturalness Hypothesis[Allamanis et al. 2018]
Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools.
“
Introduction
Literate Programming[Donald E. Knuth 1984]
I believe that the time is ripe for significantly better documentation of
programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: ‘Literate Programming.’
Let us change our traditional attitude to the construction of programs:
Instead of imagining that our main task is to instruct a computer what to do,
let us concentrate rather on explaining to human beings what we want a computer to do.
The practitioner of literate programming can be regarded as an essayist,
whose main concern is with exposition and excellence of style. Such an
author, with thesaurus in hand, chooses the names of variables carefully and
explains what each variable means. He or she strives for a program that is
comprehensible because its concepts have been introduced in an order that
is best for human understanding, using a mixture of formal and informal
methods that reïnforce each other.
“
We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations [...]
Programming languages, in theory, are complex, flexible and powerful, but the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models and leveraged for software engineering tasks.
“
Introduction
(=BA#9"=<;:3y7x54-21q/p-,+*)"!h%B0/.
~P<
<:(8&
66#"!~}|{zyxwvu
gJ%
+[-->-[>>+>-----<<]<--<---]>-.>>>+.>>..+++[.>]<<<<.+++.------.<<-.>>>>+.

Esoteric languages
Goals
- Understand if deep learning models are capable of predicting fine-grained refactorings
- Create a model for automated function extraction
function printOwing(invoice) {
printBanner();
outstanding = calculateOutstanding();
//print details
console.log(`name: invoice.customer`);
console.log(`amount: outstanding`);
}
Model
function printOwing(invoice) {
printBanner();
outstanding = calculateOutstanding();
//print details
console.log(`name: invoice.customer`);
console.log(`amount: outstanding`);
}
Goals
Goals
function printOwing(invoice) {
printBanner();
outstanding = calculateOutstanding();
//print details
console.log(`name: invoice.customer`);
console.log(`amount: outstanding`);
}
function printOwing(invoice) {
printBanner();
outstanding = calculateOutstanding();
printDetails(outstanding);
}
function printDetails(outstanding) {
console.log(`name: invoice.customer`);
console.log(`amount: outstanding`);
}
LSP
Goals

Goals
{dataset}
Why function extraction?
DataSet

DataSet
{
"type":"Extract Method",
"description":"Extract Method private extractMijCommand(rulePos int, contents String) : List<String> extracted from private extractedRuleMij(contents String) : List<String> in class com.reason.bs.Ninja",
"leftSideLocations":[
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":106,
"endLine":120,
"startColumn":5,
"endColumn":6,
"codeElementType":"METHOD_DECLARATION",
"description":"source method declaration before extraction",
"codeElement":"private extractedRuleMij(contents String) : List<String>"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":109,
"endLine":109,
"startColumn":13,
"endColumn":70,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code from source method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":111,
"endLine":111,
"startColumn":17,
"endColumn":72,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code from source method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":112,
"endLine":112,
"startColumn":17,
"endColumn":91,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code from source method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":113,
"endLine":113,
"startColumn":17,
"endColumn":54,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code from source method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":114,
"endLine":114,
"startColumn":17,
"endColumn":251,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code from source method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":115,
"endLine":115,
"startColumn":17,
"endColumn":42,
"codeElementType":"EXPRESSION_STATEMENT",
"description":"extracted code from source method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":116,
"endLine":116,
"startColumn":17,
"endColumn":39,
"codeElementType":"RETURN_STATEMENT",
"description":"extracted code from source method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":110,
"endLine":117,
"startColumn":33,
"endColumn":14,
"codeElementType":"BLOCK",
"description":"extracted code from source method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":110,
"endLine":117,
"startColumn":13,
"endColumn":14,
"codeElementType":"IF_STATEMENT",
"description":"extracted code from source method declaration",
"codeElement":"None"
}
],
"rightSideLocations":[
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":110,
"endLine":121,
"startColumn":5,
"endColumn":6,
"codeElementType":"METHOD_DECLARATION",
"description":"extracted method declaration",
"codeElement":"private extractMijCommand(rulePos int, contents String) : List<String>"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":111,
"endLine":111,
"startColumn":9,
"endColumn":63,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":113,
"endLine":113,
"startColumn":13,
"endColumn":68,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":114,
"endLine":114,
"startColumn":13,
"endColumn":87,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":115,
"endLine":115,
"startColumn":13,
"endColumn":50,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":116,
"endLine":116,
"startColumn":13,
"endColumn":247,
"codeElementType":"VARIABLE_DECLARATION_STATEMENT",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":117,
"endLine":117,
"startColumn":13,
"endColumn":38,
"codeElementType":"EXPRESSION_STATEMENT",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":118,
"endLine":118,
"startColumn":13,
"endColumn":35,
"codeElementType":"RETURN_STATEMENT",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":112,
"endLine":119,
"startColumn":29,
"endColumn":10,
"codeElementType":"BLOCK",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":112,
"endLine":119,
"startColumn":9,
"endColumn":10,
"codeElementType":"IF_STATEMENT",
"description":"extracted code to extracted method declaration",
"codeElement":"None"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":140,
"endLine":146,
"startColumn":5,
"endColumn":6,
"codeElementType":"METHOD_DECLARATION",
"description":"source method declaration after extraction",
"codeElement":"private extractRuleMijDev(contents String) : List<String>"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":143,
"endLine":143,
"startColumn":20,
"endColumn":59,
"codeElementType":"METHOD_INVOCATION",
"description":"extracted method invocation",
"codeElement":"extractMijCommand(ruleMijPos,contents)"
},
{
"filePath":"src/com/reason/bs/Ninja.java",
"startLine":120,
"endLine":120,
"startColumn":9,
"endColumn":28,
"codeElementType":"RETURN_STATEMENT",
"description":"added statement in extracted method declaration",
"codeElement":"None"
}
]
}
DataSet


Pipeline
Single refactoring?
Continouos?
Function extraction?
RefactoringMiner
Git cloning



DataSet
-
49,982 repositories
-
Function extractions found in 19,936 of them
-
523,667 different instances of function
extraction (60% more than [Aniche et al. 2020] ) -
over 80% of them are
continuous
The dataset in numbers
DataSet
{model}
2 main architectures
-
Simple RNN
-
Ptr-Net
Model
LSTM
Dense
Dense
Simple RNN
Start
Line
End
Line
Function

Ptr-Net
Model
Betas
Jaccard
Model
Optimization with Optuna

Model


- batch size= 32
- hidden size= 32
- learning rate= 0.00231519996
- weight decay= 0.0001155681898

{results}
Comparing Transformer based embeddings
Results
Comparing multilingual versions
Comparing with GloVe
Results
Comparing Architectures
Results
Best Model
- Optuna
- dbmc1
- Ptr-Net
Results
- Understand if deep learning models are capable of predicting fine-grained refactorings
- Create a model for automated function extraction
Did we meet our goals?
Results
Thank you!






Thank you!







Thank you!


{background}
...And other additional unused slides
[Bahdanau et al. 2014]
Background

Background
[Vinyals et al. 2015]
Background
[Vinyals et al. 2015]
Background
word2vec
You shall know a word by the company it keeps.
Firth (1957)
Background
Defesa
By barzilay
Defesa
Slides is a presentation platform for developers built on top of the reveal.js open source HTML presentation framework. We offer a wide range of developer-focused features like step-by-step code highlighting, a CSS editor, LaTeX typesetting and more.
- 65