Deep Software Engineering for Artificial Code Generation

Talking points (09.16.19)

  • Why does understanding the abstract representations of source code matter? What's the motivation behind? How shall we introduce the idea?

  • Uniqueness of source code:

    • Granularity concept

    • Automatic programming applicability

  • Any studies in the Computer Science community:

    • Automatic Programming (Program Synthesis)

    • Structured Generative Models of Natural Source Code (Maddison & Tarlow, 2014)

Deep Software Engineering operates with tensors for modeling high-level representations of software data, such operations are employed to automate SE tasks

Neural Network

SE

\to

Artificial Code Generation is the automatic construction of source code by means of Generative Models; algorithms which are able to generate source code might be substantially better at understanding intrinsically SE tasks

Neural Network

Program

\to

Holistic View

 Source Code Generative Agent

 Real Source Code

 Synthetic Source Code

A (deep) NN model is able to learn the abstract features of source code to generate the same structure

 Source Code Generative Agent

SE Discriminative Agent

A generative agent can be converted into a discriminative one to solve a supervised task 

 

Transfer Learning

 Source Code

SE Task

Supervised SE Task

A fine-tuned discriminative agent is able to solve and automate SE tasks in an enhance fashion

 

SE Discriminative Agent

OpenAI proves that Generative Agents can be trained for specific discriminative tasks 

 

 Source Code Generative Agent

SE Discriminative Agent

Purpose: to synthesize and understand the intrinsic properties of the source code

Purpose: to perform classification (e.g., bug fixing, security-related identification, traceability)  or regression (prediction of metrics: bugs, source code size, error proneness)

Multi-taskers

Are machines able to learn how to generate "unique" source code?

Main Research Question

Real Source Code

We know from the Hoeffding's inequality that "learning" is feasible. However, obtaining such generalization would require more than training one agent. Goodfellow demonstrates that generator and discriminator competition improves agents' knowledge

 

Gaussian Noise

Synthetic Source Code

SC Generator

SC Discriminator

"unique" source code occurs as a collective behavior from the interaction of several (generative/discriminative) agents in a controlled environment  

Hypothesis

Agents might be able to produce "unique" source code as an emergent behavior from the non-linear interactions with each other. Such emergent behavior is supported by observations from complexity science

 

Enhanced Software 2.0 Programs? or Probably Brand-New Programs?

 

Enhanced synthesized source code?

This research has, in total, four views

 First view: The generative Agent

 Second view: The discriminative Agent

 Third view: Generative Agents by Competition

 Fourth view: Emergent "uniqueness" by complex interactions

 First view: The generative Agent

 Second view: The discriminative Agent

 Third view: Generative Agents by Competition

 Fourth view: Emergent "uniqueness" by complex interactions

 First view: The generative Agent

 Second view: The discriminative Agent

 Third view: Generative Agents by Competition

 Fourth view: Emergent "uniqueness" by complex interactions

 First view: The generative Agent

 Second view: The discriminative Agent

 Third view: Generative Agents by Competition

 Fourth view: Emergent "uniqueness" by complex interactions

SE-Based Benchmarks for Understanding and Comparing Deep Generative Source Code Models

Why benchmarking for deep generative source code models?

  • Fair comparison against other models
  • Understanding of learning deep models 
  • Standardization of datasets and statistical analysis
  • Enhancing reproducibility and replicability
  • Identifying DL-oriented errors like:
    • Long-range dependencies
    • Rare tokens
    • Semantic (i.e. returning the correct type)
 

A Benchmark is composed of:

  • Test Case (goal): Identifying Long-Range Interaction 
  • Testbed: 100 java methods with different token sizes
  • Procedure:
    1. Train a generative model G(x)
    2. Compute the cross-entropy loss least 40 times for the testbed
  • Performance metric:
    • Mean cross-entropy across methods
    • Standard-deviation cross-entropy 

Structure of the paper

Introduction

  • Motivating how important is to build proper benchmarks for generative models and the advantages of having tailored statistical procedures to evaluate DL-models
  • Introducing generative models in a gentle way (guided)  and transfer learning
Train Code Data \sim p_{data}(x)
Generated Code Samples \sim p_{model}(x)

The goal of generative research is to answer the question: How can we learn p_model similar to p_data?

p_{model}(x) \approx p_{data}(x)
Train Code Data \sim p_{data}(x)
Generated Code Samples \sim p_{model}(x)

The goal of generative research is to answer the question: How can we learn p_model similar to p_data?

p_{model}(x) \approx p_{data}(x)

Observational Joint

Interventional Joint

Deep Generative Obvs. Code Models

 
  • Generative model relationships: (x: source code) (z or y labels) (Theta: parameters)

Unconditioned Generator

p(x,\theta)
p(x|z,\theta)
p(y|x,\theta)

Conditioned Generator

Discriminator

Observational Code Model

Deep Generative Inter. Code Models

 
  • Generative model relationships: (x: source code) (z or y labels) (Theta: parameters)

Unconditioned Generator

p_{do(\Theta = \theta)}(x,\theta)
p(x|z,do(\theta))
p(y|x,do(\theta))

Conditioned Generator

Discriminator

Interventional Code Model

p_{data}(x,y,z,\theta)

Deep Generative Obvs. Code Models

 

Unconditioned Generator

p(x,\theta)
p(x|z,\theta)
p(y|x,\theta)

Conditioned Generator

Discriminator

Deep Generative Inter. Code Models

 
  • Generative model relationships: (x: source code) (z or y labels) (Theta: parameters)

Unconditioned Generator

p_{do(\Theta = \theta)}(x,\theta)
p(x|z,do(\theta))
p(y|x,do(\theta))

Conditioned Generator

Discriminator

p_{model}(x,y,z,\theta)
p_{data}(x,y,z,\theta)

Deep Generative Obvs. Code Models

 
  • Generative model relationships: (x: source code) (z or y labels) (Theta: parameters)

Unconditioned Generator

p(x,\theta)
p(x|z,\theta)
p(y|x,\theta)

Conditioned Generator

Discriminator

Self-Supervised

Unsupervised

Supervised

p_{data}(x,y,z,\theta) \to G_{\theta}^O(x)
p_{data}(x,y,z,\theta) \to D_{\theta}^O(y|x)
G^O(x,\theta)

Triangle of Observational/Interventional Sampling

 

Unconditional

p(x|w=w_0,\theta)
p(x|w,\theta)
p(y|x,\theta)

Conditional

Classification

p_{data}(x,w,y,\theta) \to G^O(x,\theta)
p_{data}(x,y,z,\theta) \to D_{\theta}^O(y|x)
p(x|w=w_0,do(\theta))
p(x|w,do(\theta))
p(y|x,do(\theta))
p_{model}(x,w,y,\theta) \to G_{do(\Theta=\theta)}^I(x,\theta)
G_{do(\Theta=\theta)}^I(x,\theta)

Deep Code Generator

Autoregressive Transfer Learning

 

Unconditioned Generator

Conditioned Generator

Discriminator

Unsupervised

Supervised:

  • Transf AWD
  • Trans Tranformer

Self-Supervised:

  • AWD-LSTM
  • Tranformer

Fine-Tuning

?

p(x,\theta)
p(x|z,\theta)
p(y|x,\theta)

Sampling

 
  • Sampling Methods: temperature (others top-k or nucleus)

Unconditioned Generator

Conditioned Generator

Unconditioned Sampling

Unconditioned Training

Conditioned Sampling

p_{do(\Theta = \theta)}(x,\theta)
p(x|z,do(\theta))

Empirical Evaluation

  • Unconditioned Model
  • Conditioned Model

Unconditioned Evaluation (interventional and observational)

Unconditioned Models: Manifold Analysis

 
  • Manifold Visualization and Vectorization: code2vec

Unconditioned Interventional  Sampling

Unconditioned Observational Sampling

p_{do(\Theta = \theta)}(x,\theta)
p(x,\theta)

Unconditioned Models: Semantic Manifold

 

Unconditioned Interventional  Sampling

Unconditioned Observational Sampling

p_{do(\Theta = \theta)}(x,\theta)
p(x,\theta)
KL(p_{do(\Theta = \theta)}(x,\theta) || p(x,\theta) )
KL( p(x,\theta) || p_{do(\Theta = \theta)}(x,\theta) )

Unconditioned Models: Syntactic Manifold

 

Unconditioned Interventional  Sampling

Unconditioned Observational Sampling

p_{do(\Theta = \theta)}(x,\theta)
p(x,\theta)
\sigma_{syx}( p_{do(\Theta = \theta)}(x,\theta) )
\sigma_{syx}( p(x,\theta) )

Unconditioned Models: SE Structure Manifold

 

Unconditioned Interventional  Sampling

Unconditioned Observational Sampling

p_{do(\Theta = \theta)}(x,\theta)
p(x,\theta)
\sigma_{str}( p_{do(\Theta = \theta)}(x,\theta) )
\sigma_{str}( p(x,\theta) )

In conclusion, Manifold Analysis is three fold

  • Semantic
  • Syntactic
  • SE Structure

Conditioned Evaluation (potential outcomes and counterfactuals)

Conditioned Models: Randomized Experiment

 
A \in (1:lstm, 0:txl)

Y is a code assessment property and A is the "treatment" or model employed. 

Y \to (syx, str, loss)
E \in (1:noisy, 0:feasible)

Conditioned Models: Randomized Experiment

 
A \in (1:lstm, 0:human)

Syntax Correctness in all the individuals (or samples) under lstm treatment

Y \to syx
Y^{a=1}
Y^{a=0}

Syntax Correctness in all the individuals (or samples) under human treatment

Causal Inference Evaluation

 
A \in (1:lstm, 0:human)

We say that the generative agent (or autoregressive model) has a causal effect on the Syntax Correcteness if ...

Y \to syx
Y^{a=1}
Y^{a=0}
Y^{a=1} \neq Y^{a=0}

Causal Inference Evaluation: Null Causality

 
A \in (1:lstm, 0:human)

We are interested in null causality. We don't want to observe a causal effect on outcomes. 

Y \to syx
Y^{a=1}
Y^{a=0}
E[P^{a=1}] - E[P^{a=0}] = 0
E[\sigma_{syx}(\hat(x))] - E[\sigma_{syx}(x))] = 0
\mathbb{E}_\theta

Interactions

 
A \in (1:lstm, 0:human)

We are interested in null causality. We don't want to observe a causal effect on outcomes. 

Y \to syx
Y^{a=1}
Y^{a=0}
E[P^{a=1}] - E[P^{a=0}] = 0
E[\sigma_{syx}(\hat(x))] - E[\sigma_{syx}(x))] = 0

Deep Generative Source Code Models

 
  • It introduces what is p(x), p(x|y) and p(y|x)
  • Autoregressive as Generative Models
  • Transfer Learnt Autoregressive
  • Sampling Methods: top-k or nucleus
  • Manifold Visualization and Vectorization: code2vec

Data Collection and Analysis

  • Datasets: raw data and structured data
    • CodeSearchChallenge (~6M-method g)
    • TitanGenCode (~1M-language-file g)
    • TufanoBuggy/NonBuggy (~1M)

Distribution of the dataset

Training

Validation

Test

BPE

Data Collection and Analysis

Training (0.8)

Validation (0.1)

Test (0.09)

BPE (0.01)

[feasible-java-methods]

Testbed Generation

[feasible-py-methods]

[transXL-java-samples]

Transformation

[noisy-py-methods]

Deep Generators

[TransformerXL]

[AWD-LSTM]

Sampling

[noisy-java-methods]

[lstm-java-samples]

[transXL-py-samples]

[lstm-py-samples]

\mathbb{E}_\theta[P^{a=1}] - \mathbb{E}_\theta[P^{a=0}]
p(x|w=w_0,do(\theta))
G^O(x,\theta)
G_{do(\Theta=\theta)}^I(x,\theta)

Unconditional Interpretability 

Artificial Code Data

Human Code Data

Manifold

Causal Inference

Conditional Interpretability

1

2

1

4

3

5

\vec\gamma
\vec\alpha
\vec\beta
T_i(x) = \hat{x}
d_{k}(\vec\alpha,\vec\beta)

Manifolds

LOC

CYCLO

FORs

\vec\alpha
\vec\beta
d_{cpl}(\vec\alpha,\vec\beta)=\|\vec\alpha - \vec\beta\|
d_{str}(\vec\alpha,\vec\beta)=\|\vec\alpha - \vec\beta\|

...

LOC

CYCLO

FORs

...

SE Structure Distance

1

;

-

-

...

;

-

-

...

;

-

-

...

\vec\gamma

Compilation Error Distance

d_{cpl}(\vec\gamma,\vec\beta)=\|\vec\gamma - \vec\beta\|

2

3

4

\vec\gamma
\sim p_{machine}(x,\theta)
\sim p_{human}(x,\theta)

Inf. Contentent & Semantic Distance

d_{sem}(human,artificial) = KL( G^O(x,\theta) || p(x|w=w_0,do(\theta)) )
d_{sem}^{KL}(p_{human},p_{machine})
d_{sem}^{s}(p_{human},p_{machine})

1

2

3

Data Collection and Analysis

  • Pre-processing with Byte-Pair Encoding

Training

Validation

Test

BPE

  • Control of vocabulary
  • Compressing information with minimum loss of information

Data Collection and Analysis

  • SE-oriented Exploratory Analysis (not just descriptive statistics)
    • Finding unbalanced data and biased (KL divergence and Cross-Entropy)
    • Entropy levels and gain information per method
    • Structure of the data
    • Quality of the data (SE metrics) and syntax correctness
    • Distribution of the data
    • Data snooping (clone detection)

Data Collection and Analysis

  • Datasets
  • Testbeds
  • Pre-processing with Byte-Pair Encoding
  • SE-oriented Exploratory Analysis 

Benchmarking and Performance Metric Desing

  • For Generative Unconditioned Models P(x)
  • For Generative Conditioned Models P(x|y)
  • For Discriminative Model P(y|x):
    • Just one "special" SE-task
    • Transfer Learnt Discriminative Model

Benchmarking and Performance Metric Desing

  • For Generative Conditioned Models P(x|y)
  • Test Case: Long-range interactions. Statistical analysis (probability on closing tokens). For example, the distance between '{' and '}'
  • Testbeds: [brace-end-x] :
    • 0-20 granularity
    • 20-40 granularity
    • 40-60 granularity
  • Procedure:
    • Retrieve the predicted probability for the ending token 
    • Make several inferences (around 35) to create confidence intervals
  • Performance Metric: Mean P("}")
 

Benchmarking and Performance Metric Desing

  • For Generative Unconditioned Models P(x)
  • Test Case: Alien Meaningfulness Test Case. To identify unique/new methods or files that are not contained in the original training set. This analysis will provide insights into how different the generated code is from the code used for training
  • Testbeds: [gen-code-x]:
  • Procedure: Code Vectorization, K-medoids, overlapping, compute distances. 
  • Performance Metric: Uniqueness or distance between medoids
 

Alien Prototype

(confidence or entropy)

Original Training Set

Alien Criticism

Alien Prototype

Case Study 1

  • Benchmarking for Unconditioned/Conditioned Autoregressive Models
    • Models: AWD-LSTM, Transformer, and n-gram
    • Dataset: [SearchCodeChallenge] and [TitanGenCode]
    • Run Previous Benchmarks

Case Study 2

  • Benchmarking for Transfer Autoregressive Model
    • Models: AWD-LSTM (transferred)
    • Dataset: [Buggy-NonBuggy-Tufano] 
    • Run Previous Discriminative Benchmark

The generative agent

First View

p(x,y)

 Source Code Generative Agent

The generative model is autoregressive. That is, it is trained on sequential data by predicting the next token 

p(x,y)

Conditioned Sampling

Unconditioned Sampling

P(w_0, \dots,w_m) = \prod_{i=0}^{m}P(w_i|w_1,\dots,w_{i-1})
P(w_0|token)
P(w_i|w_1,\dots,w_{i-1})

Conditioned Sampling Analysis

  • Cross-Entropy measurement
    • Noisy datasets
    • Optimal testbeds
  • Long-term dependencies with especial characters
  • Typification of errors
P(w_i|w_1,\dots,w_{i-1})

Are machines able to produce correct source code? Which type of errors are generated?

Research Question

Unconditioned Sampling Analysis

  • Mani-fold analysis
  • KL-Divergence comparison
  • Alien clustering
P(w_i|w_1,\dots,w_{i-1})

Are machines able to produce unseen source code? Which type of code is generated?

Research Question

Study Design

Autoregressive Models Under Study

  • n-grams (for Source-Code)
  • LSTM (many-to-one, many-to-many)
  • GRU (many-to-one, many-to-many)
  • Bi-(LSTM/GRU)
  • Transformer (for Source-Code)

Output Space Analysis

Feature Space Analysis

Conditioned Sampling

Unconditioned Sampling

P(w_0|token)
P(w_i|w_1,\dots,w_{i-1})

Feature Clustering Representation

Cell Activation

Output Space Analysis

Feature Space Analysis

Conditioned Sampling

Unconditioned Sampling

P(w_0|token)
P(w_i|w_1,\dots,w_{i-1})

Clustering Represenation

Cell Activation

Pipeline (on a given 'g' granularity)

Unconditioned Sampling

P(w_0|token)

Pipeline (on a given 'g' granularity)

  1. Open-Ended Sampling (Beam, Top k & Nucleus)
  2. Data Vectorization (skip-grams or autoencoders)
  3. Identifying clusters on synthesized and human source code:
    • Convex - Concave
    • Computing centroids
    • Uniqueness (criticisms and prototypes: separation of centroids) 

Unconditioned Sampling

P(w_0|token)

A math representation of Source Code "Uniqueness"

Unconditioned Sampling

P(w_0|token)
\mathbb{R}^n \to \mathbb{R}^3

Uniqueness: distance from centroids 

u = |d(c_i, c_j)|

Pipeline (on a given 'g' granularity)

  1. Open-Ended Sampling (Beam, Top k & Nucleus)
  2. Data Vectorization (skip-grams or autoencoders)
  3. Identifying clusters on synthesized and human source code:
    • Convex - Concave
    • Computing centroids
    • Uniqueness (criticisms and prototypes: separation of centroids) 
  4. Run Syntax Checker on Medioids (a measure of meaningfulness)

Unconditioned Sampling

P(w_0|token)

A math representation of Source Code "Meaningfulness"

Unconditioned Sampling

P(w_0|token)

Static: Syntax Checkers

Syntax Error Rate

Static and Dynamic Meaningfulness

Pipeline (on a given 'g' granularity)

  1. Open-Ended Sampling (Beam, Top k & Nucleus)
  2. Data Vectorization (skip-grams or autoencoders)
  3. Identifying clusters on synthesized and human source code:
    • Convex - Concave
    • Computing centroids
    • Uniqueness (criticisms and prototypes: separation of centroids) 
  4. Run Syntax Checker on Medioids (a measure of meaningfulness)
  5. Identify and describe "aliens" by overlapping human and synthetic datasets

Unconditioned Sampling

P(w_0|token)

Aliens by Overlapping

Unconditioned Sampling

P(w_0|token)

Alien Cluster

(confidence or entropy)

Alien Sample

Pipeline (on a given 'g' granularity)

  1. Open-Ended Sampling (Beam, Top k & Nucleus)
  2. Data Vectorization (skip-grams or autoencoders)
  3. Identifying clusters on synthesized and human source code:
    • Convex - Concave
    • Computing centroids
    • Uniqueness (criticisms and prototypes: separation of centroids) 
  4. Run Syntax Checker on Medioids (a measure of meaningfulness)
  5. Identify and describe "aliens" by overlapping human and synthetic datasets
  6. Compute KL-Divergence (distance from synthetic and human sets)

Unconditioned Sampling

P(w_0|token)

Pipeline (granularity: method level)

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})

Pipeline (granularity: method level)

  1. Long-range interactions: statistical analysis (probability on closing tokens). For example, the distance between '{' and '}'
    • Generate testbeds (if-else, {-}, (-), return, ';'):
      • 0-20 granularity
      • 20-40 granularity
      • 40-60 granularity
    • Retrieve the predicted probability for the ending token 
    • Make several inferences (around 35) to create confidence intervals
 

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})

Pipeline (granularity: method level)

  1. Long-range interactions: statistical analysis (probability on closing tokens). For example, the distance between '{' and '}'
    • Generate testbeds (if-else, {-}, (-), return, ';'):
      • 0-20 granularity
      • 20-40 granularity
      • 40-60 granularity
    • Retrieve the predicted probability for the ending token 
    • Make several inferences (around 35) to create confidence intervals
 

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})

Pipeline (granularity: method level)

  1. ..
  2. Error Analysis: to gain deeper insight into the errors that are unique to the generative models 
    • "... we define a character to be an error if the probability assigned to it by a model on the previous time step is below 0.5 ..."
    • Build test-set
    • Compute [avg+-std] probability assigned to the correct (target) token
 

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})
for (int i = 0
;
=
for
int

....

0.3

0.4

0.01

0.01

Pipeline (granularity: method level)

  1. ..
  2. Error Analysis: to gain deeper insight into the errors that are unique to the generative models 
    • "... we define a character to be an error if the probability assigned to it by a model on the previous time step is below 0.5 ..."
    • Build test-set
    • Compute [avg+-std] probability assigned to the correct (target) token
 

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})
for (int i = 0
;
=
for
int

....

0.3

0.4

0.01

0.01

Pipeline (granularity: method level)

  1. ...
  2. ...
  3. Failure cases: limitations of the generative models, the relative severity of each error, and to suggest areas for further study [probability]
    • Categorize errors made in previous steps and create the procedures to remove them:
      • "Return" oracle (e.g., scaling up neurons)
      • "Repetitive tokens" oracle (e.g., augmenting n-gram window)
      • "Bad Smells / Antipatterns" oracle 
      • "Bugs" oracle
      • "Syntax error" oracle
 

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})

Pipeline (granularity: method level)

  1. ...
  2. ...
  3. Failure cases: limitations of the generative models, the relative severity of each error, and to suggest areas for further study [probability]
    • Categorize errors made in previous steps and create the procedures to remove them:
      • "Return" oracle (e.g., scaling up neurons)
      • "Repetitive tokens" oracle (e.g., augmenting n-gram window)
      • "Bad Smells / Antipatterns" oracle 
      • "Bugs" oracle
      • "Syntax error" oracle
 

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})

Transformer (20 layers)

Transformer (50 layers)

  • syntax error removal
  • token repetition removal

Pipeline (granularity: method level)

  1. ...
  2. ...
  3. ...
  4. Entropy Analysis: (counterfactual analysis)
    1. Well-written code testbed
    2. Noisy code testbed
      • Use mutations on well-written code
 

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})

Well-written testbed

Which generative model is a good predictor?

Correlation

Pipeline (granularity: method level)

  1. ...
  2. ...
  3. ...
  4. Entropy Analysis: (counterfactual analysis)
    1. Well-written code testbed
    2. Noisy code testbed
      • Use mutations on well-written code
 

Conditioned Sampling

P(w_i|w_1,\dots,w_{i-1})

Well-written testbed

Noisy testbed 1

mutations

Noisy testbed 2

Which generative model is a good predictor?

Correlation

Causation

Projects

  • Subproject 1: "Visualizing and Understanding Deep Autoregressive Generators for Source Code"

The discriminative agent

Second View

p(y|x)

 Source Code Discriminative Agent

The generative model can be adapted by transfer learning strategies to become a discriminative one 

p(y|x)

Classification

Regression

Classification Analysis

  • Detecting Bugs
  • Security-related identification
  • Summarizing Source Code

Are pre-trained machines able to enhance the performance of supervised approaches? To what extent do Source Code Generators optimize the classification error? 

Research Question

 Source Code Generative Agent + Discriminative

SE Multitask Agent

Unsupervised multi-task learners are employed in Language Models, the same way we can employ SE Multitask learners!

  • Code Summarization
  • Code Completion 
  • Code Translation (bug fixing)

Projects

  • Subproject 2: "Towards Enhancing Deep Software Classifiers via Pre-train (Generative) Models"
  • Subproject 3: "Unsupervised Software Maintenance Multi-tasker "

Subproject #2

Universal Language Model FineTuning (ULMFiT)

  • Pretraining a LM for better performance on downstream tasks
  • Avoids catastrophic forgetting through specialized learning rates and gradual unfreezing of model weights

Counter Factual (Ablation) Study

  • Compare frozen LM vs ULMFiT
  • Trace back performance on downstream tasks to different error oracles
  • Allows for analysis of which error type most impacts performance on downstream tasks

Downstream Tasks

  • Classification:
    • Vulnerable / Non Vulnerable
    • Design Pattern
    • Code Smell
    • Clone Detection
  • Regression:
    • Bug localization
  • Sequence to Sequence:
    • Bug Repair
    • Comment Generation
    • Code Migration
    • Test Case Generation

Subproject #3

Broad Research Goals

  • Analyze different schemes of training an Auto Regressive Language Model for performing supervised tasks
  • Compare approach against other SOTA models on the different supervised tasks
  • Compare other transfer learning approaches against our transfer learning approach
  • Evaluate ability of trained Language Model as a single task learner as well as multitask learner

GPT-2

Unsupervised Multi-Task Learner

  • Able to perform multiple tasks (i.e. summarization, QA, translation, etc)
  • Uses zero shot learning on target task
  • Produces very convincing and coherent text
  • Trained only on producing the next word given some context (millions of examples)

Limitations

  • Only applicable to tasks that resemble those found in the training data (e.g. blog posts that contain TLDRs for text summarization)
  • Requires a significant amount of data to get sub-par results across multiple tasks

Applying Auto Regressive Language Models to Supervised Tasks

Supervised Tasks as AR LM Tasks

  • Supervised learning trains models on x, y pairs
  • Can convert supervised tasks into AR LM tasks
    • Treat y as a part of the vocabulary the AR LM is trying to predict given the context x
    • To make multi tasking easier, add a special token to each supervised task

Classification Example

  • Supervised task: Given some method x, predict if the a  method has some code smell y
  • AR LM Task: Convert code smell y into some text term such as "Long method" and append this term to x.
    • E.g. "public static void ... } <code_smell> Long method"
  • This gives the AR LM a bunch of training data that teaches it to produce the term "Long method" if it is given a method that is long.

Types of Supervised Tasks (SME)

Classification

  • Given some class or method x, generate corresponding code smell y
  • Given some class or method x, generate corresponding design pattern y
  • Given some class or method x, generate corresponding ransomware y
  • Given two classes or methods,x and x', generate corresponding clone classification y

Sequence to Sequence

  • Given some question about some class or method x, generate corresponding answer y
  • Given some class or method x, generate corresponding comments y
  • Given some class or method x, generate corresponding test cases y
  • Given some class or method x in some language p, generate corresponding method y in some language q

Counter Factual (Ablation) Study

  • Trace back performance on downstream tasks to different error oracles
  • Allows for analysis of which error type most impacts performance on downstream tasks

Empirical Evaluation of Supervised Tasks

  • Compare training LM on multiple tasks vs single tasks
  • Compare LM against SOTA supervised models
  • Compare LM SOTA against transfer learning approaches
  • Compare pretraining LM vs unpretraining LM for multi-task and single-task performance

Generative Agents by Competition

Third View

Real Source Code

Gaussian Noise

Synthetic Source Code

SC Generator

SC Discriminator

The generator learns how to create source code in such a way that the discriminator is not able to distinguish synthetic source code

Real Source Code

Gaussian Noise

Synthetic Source Code

SC Generator

SC Discriminator

The generator is an "Intelligent Agent" that is able to enhance its source code by competition (game theory)

Agent

Enviroment

To what extent do machines generate (human-level) Source Code? Are agents competition producing "unseen" source code?

Research Question

Projects

  • Subproject 5: "Can Machines Generate (human-level) Source Code?"

Generating "human-level" Source Code with NeuroEvolution

Fourth View

Large scale agent interactions might produce "emergent" (human-level) source code

Close-ended Evolution

Open-Ended Evolution

f(x) = min(E)

Evolutionary Computation to communicate SC agents

Neural Network

Are Neural Networks Turing Complete?

Algorithm

Program

\approx
\approx

ON THE TURING COMPLETENESS OF MODERN NEURAL NETWORK ARCHITECTURES (ICLR'19 Perez, et al.)

Close-ended Evolution

  • Define a fitness function to reduce the entropy
f(x) = min(E)

Close-ended Evolution

  • Define a fitness function to reduce the entropy
  • An individual (genotype) is a Neural Network
f(x) = min(E)

Close-ended Evolution

  • Define a fitness function to reduce the entropy
  • An individual (genotype) is a Neural Network
  • An individual (phenotype) is Source Code
f(x) = min(E)

Close-ended Evolution

  • Define a fitness function to reduce the entropy
  • An individual (genotype) is a Neural Network
  • An individual (phenotype) is Source Code
  • Genetic Operators are based on "Transfer Learning" strategies
f(x) = min(E)

Are machines able to learn how to generate "human-level" source code?

Main Research Question

Open-ended Evolution

  • "a process in which there is the possibility for an indefinite increase in complexity" (Corominas, et al. 2018)
  • John von Neumann:  self-replication, genotype-phenotype mappings, special classes of material substrates and Physico-chemical processes
  • Alan Turing: Morphogenesis
1

https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0395

  • Property 1: Simple Components or agents (simple relative to the whole system)

Auto-regressive, adversarial, or autoenconder architectures

Trained Generative Agent

Trained Fine-Tuned Agent

Deep Neural Classifier

  • Property 2: Nonlinearity and Complex Interactions (synergy)

Better Performance

Generative Agents are sensitive to initial conditions (hyper-parameters) and inputs

Fine-Tune Strategy 1

Fine-Tune Strategy 2

Worst Performance

  • Property 3: Descentralization or no central control 

No leading agent or "deep neural net" controlling for interactions

  • Property 4: Emergency

Case study 1 [self-replication]: Are self-replicated "programs" (or NN or Software 2.0) somewhat better? What type of properties have? Can the multi-tasker agents perform a brand new task?

  • Property 4: Emergency

Case study 2 [self-organization]: are the generative agents reporting enhanced accuracy after transfer-learning interactions? 

Assembled Agent (by transfer learning strategies)

Enhanced synthesized code?

Simple and Local Transfer Rules

Better accuracy? What type of tasks emerged?

Fine-Tuning

Complex Programs or Advanced Software Systems

Projects

  • Subproject 6: "Emerging Human-Level Source Code by Complex Interactions of Deep Software Engineering Agents"

Interpretability

Fifth View

Title Text

  • [semantic|conditioned] Learned Representation Analysis for SE Metrics: to determine if the learned representation from the generators has learned about the concept of SE metrics such as cyclomatic complexity, lines of code, etc.

    • Probing classifier - MLP that is fed the representation (i.e. hidden state) of a method the model is given and is asked to predict some SE Metric, i.e. cyclomatic complexity

    • Performance Metric: We are attempting to measure how well the generator is able to capture SE related metrics and so we will be measuring mostly precision, recall, accuracy for each SE metric.

Title Text

  • [code-smell-01] Testbed of smelly methods such as long method, feature envy, etc.
  • [design-pattern-01] Testbed of classes with different design patterns such as factory method, singleton, decorator, etc.

  • [anti-pattern-01] Testbed of classes with different anti patterns such as anemic domain model, call super, circular dependency, etc.

  • [ast-01] Testbed of methods with different types of AST nodes and relations.

  • [cfg-01] Testbed of methods with different types of CFGs.

  • [type-01] Testbed for classifying the type based on the variable name

Summary

 First view: The generative Agent

 Second view: The discriminative Agent

 Third view: Generative Agents by Competition

 Fourth view: Emergent "uniqueness" by complex interactions

Deep Software Engineering for Artificial Code Generation

By David Nader Palacio

Deep Software Engineering for Artificial Code Generation

  • 269