Using Natural Language Processing techniques for automated code refactoring

 

>Alan Barzilay
_

{summary}

Introduction

Goals

Background

DataSet

Model

Results

Naturalness Hypothesis[Allamanis et al. 2018]

     Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools. 

Introduction

Literate Programming[Donald E. Knuth 1984]

      I believe that the time is ripe for significantly better documentation of
programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: ‘Literate Programming.’
Let us change our traditional attitude to the construction of programs:
Instead of imagining that our main task is to instruct a computer what to do,
let us concentrate rather on explaining to human beings what we want a computer to do.


The practitioner of literate programming can be regarded as an essayist,
whose main concern is with exposition and excellence of style. Such an
author, with thesaurus in hand, chooses the names of variables carefully and
explains what each variable means. He or she strives for a program that is
comprehensible because its concepts have been introduced in an order that
is best for human understanding, using a mixture of formal and informal
methods that reïnforce each other. 

     We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations [...]


Programming languages, in theory, are complex, flexible and powerful, but the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models and leveraged for software engineering tasks.

Introduction

(=BA#9"=<;:3y7x54-21q/p-,+*)"!h%B0/.
~P<
<:(8&
66#"!~}|{zyxwvu
gJ%
+[-->-[>>+>-----<<]<--<---]>-.>>>+.>>..+++[.>]<<<<.+++.------.<<-.>>>>+.

Esoteric languages

Goals

  • Understand if deep learning models are capable of predicting fine-grained refactorings
  • Create a model for automated function extraction
function printOwing(invoice) {
  printBanner();
  outstanding = calculateOutstanding();

  //print details
  console.log(`name: invoice.customer`);
  console.log(`amount: outstanding`);  
}

Model

function printOwing(invoice) {
  printBanner();
  outstanding = calculateOutstanding();

  //print details
  console.log(`name: invoice.customer`);
  console.log(`amount: outstanding`);  
}

Goals

Goals

function printOwing(invoice) {
  printBanner();
  outstanding = calculateOutstanding();

  //print details
  console.log(`name: invoice.customer`);
  console.log(`amount: outstanding`);  
}
function printOwing(invoice) {
  printBanner();
  outstanding = calculateOutstanding();
  printDetails(outstanding);
}

function printDetails(outstanding) {
  console.log(`name: invoice.customer`);
  console.log(`amount: outstanding`);
}

LSP

Goals

Goals

{dataset}

Why function extraction?

DataSet

DataSet

{
   "type":"Extract Method",
   "description":"Extract Method private extractMijCommand(rulePos int, contents String) : List<String> extracted from private extractedRuleMij(contents String) : List<String> in class com.reason.bs.Ninja",
   "leftSideLocations":[
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":106,
         "endLine":120,
         "startColumn":5,
         "endColumn":6,
         "codeElementType":"METHOD_DECLARATION",
         "description":"source method declaration before extraction",
         "codeElement":"private extractedRuleMij(contents String) : List<String>"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":109,
         "endLine":109,
         "startColumn":13,
         "endColumn":70,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":111,
         "endLine":111,
         "startColumn":17,
         "endColumn":72,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":112,
         "endLine":112,
         "startColumn":17,
         "endColumn":91,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":113,
         "endLine":113,
         "startColumn":17,
         "endColumn":54,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":114,
         "endLine":114,
         "startColumn":17,
         "endColumn":251,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":115,
         "endLine":115,
         "startColumn":17,
         "endColumn":42,
         "codeElementType":"EXPRESSION_STATEMENT",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":116,
         "endLine":116,
         "startColumn":17,
         "endColumn":39,
         "codeElementType":"RETURN_STATEMENT",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":110,
         "endLine":117,
         "startColumn":33,
         "endColumn":14,
         "codeElementType":"BLOCK",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":110,
         "endLine":117,
         "startColumn":13,
         "endColumn":14,
         "codeElementType":"IF_STATEMENT",
         "description":"extracted code from source method declaration",
         "codeElement":"None"
      }
   ],
   "rightSideLocations":[
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":110,
         "endLine":121,
         "startColumn":5,
         "endColumn":6,
         "codeElementType":"METHOD_DECLARATION",
         "description":"extracted method declaration",
         "codeElement":"private extractMijCommand(rulePos int, contents String) : List<String>"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":111,
         "endLine":111,
         "startColumn":9,
         "endColumn":63,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":113,
         "endLine":113,
         "startColumn":13,
         "endColumn":68,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":114,
         "endLine":114,
         "startColumn":13,
         "endColumn":87,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":115,
         "endLine":115,
         "startColumn":13,
         "endColumn":50,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":116,
         "endLine":116,
         "startColumn":13,
         "endColumn":247,
         "codeElementType":"VARIABLE_DECLARATION_STATEMENT",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":117,
         "endLine":117,
         "startColumn":13,
         "endColumn":38,
         "codeElementType":"EXPRESSION_STATEMENT",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":118,
         "endLine":118,
         "startColumn":13,
         "endColumn":35,
         "codeElementType":"RETURN_STATEMENT",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":112,
         "endLine":119,
         "startColumn":29,
         "endColumn":10,
         "codeElementType":"BLOCK",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":112,
         "endLine":119,
         "startColumn":9,
         "endColumn":10,
         "codeElementType":"IF_STATEMENT",
         "description":"extracted code to extracted method declaration",
         "codeElement":"None"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":140,
         "endLine":146,
         "startColumn":5,
         "endColumn":6,
         "codeElementType":"METHOD_DECLARATION",
         "description":"source method declaration after extraction",
         "codeElement":"private extractRuleMijDev(contents String) : List<String>"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":143,
         "endLine":143,
         "startColumn":20,
         "endColumn":59,
         "codeElementType":"METHOD_INVOCATION",
         "description":"extracted method invocation",
         "codeElement":"extractMijCommand(ruleMijPos,contents)"
      },
      {
         "filePath":"src/com/reason/bs/Ninja.java",
         "startLine":120,
         "endLine":120,
         "startColumn":9,
         "endColumn":28,
         "codeElementType":"RETURN_STATEMENT",
         "description":"added statement in extracted method declaration",
         "codeElement":"None"
      }
   ]
}

DataSet

Pipeline

Single refactoring?

Continouos?

Function extraction?

RefactoringMiner

Git cloning

DataSet

  • 49,982 repositories

  • Function extractions found in 19,936 of them

  • 523,667 different instances of function
    extraction (60% more than [Aniche et al. 2020] )

  • over 80% of them are
    continuous

The dataset in numbers

DataSet

{model}

2  main architectures

  • Simple RNN

  • Ptr-Net

Model

LSTM

Dense

Dense

Simple RNN

Start

Line

End

Line

Function

Ptr-Net

\textit{softargmax}(x)=\sum_i \frac{e^{\beta x_i}}{\sum_j e^{\beta x_j}}i

Model

Betas

Jaccard

Model

Optimization with Optuna

Model

  • batch size= 32
  • hidden size= 32
  • learning rate= 0.00231519996
  • weight decay= 0.0001155681898

{results}

Comparing Transformer based embeddings

Results

Comparing multilingual versions

Comparing with GloVe

Results

Comparing Architectures

Results

Best Model

  • Optuna
  • dbmc1
  • Ptr-Net

Results

  • Understand if deep learning models are capable of predicting fine-grained refactorings
  • Create a model for automated function extraction

Did we meet our goals?

Results

Thank you!

Thank you!

Thank you!

{background}

...And other additional unused slides

[Bahdanau et al. 2014]

Background

Background

 [Vinyals et al. 2015]

Background

 [Vinyals et al. 2015]

Background

word2vec

You shall know a word by the company it keeps.

Firth (1957)

Background

Defesa

By barzilay

Defesa

Slides is a presentation platform for developers built on top of the reveal.js open source HTML presentation framework. We offer a wide range of developer-focused features like step-by-step code highlighting, a CSS editor, LaTeX typesetting and more.

  • 65