A Survey of Source Code Summarization

Devjeet Roy

QE Oral Presentation

Introduction

Software Developers spend a significant portion of their time comprehending source code

Proper documentation and inline comments in source code is vital to the comprension process.

Maintaining documentation is resource intensitve

Automatic Source Code Summarization

The automatic generation of source code documentation without human interventation

Approaches exist that apply to a wide range of source code artifacts:

* methods, classes, source code segments

* version control commits

* release notes

This survey

+ Abstractive Summarization only

+ Source code intrinsic documentation only (excludes generation of release notes, commit messages and other extrinsic documentation)

- Excludes crowd-sourcing based approaches (which are sometimes included in literature reviews)

Methodology

Template Based Approaches

Relies on pre-defined templates for summary generation.

Templating systems range from simple, text templates to complex Natural Language Generation (NLG) systems.

Typically use information retrieval (IR) and/or program analysis to fill templates

Deep Learning

Machine learning technique

Generates full sentences directly without the use of templates

Typically trained on large amounts of data for state-of-the-art performance

Templates vs Deep Learning

Template based approaches require significant amounts of human effort to build/maintain templates

Deep learning based approaches are expensive computationally and require large amounts of data

However, deep learning generalizes better, and can often be more natural.

Templates vs Deep Learning

The first deep learning based summarization approaches appeared in 2016.

Ever since 2018, most summarization approaches are based on deep learning

Evaluation Methods

Manual Verification

Researchers manually verify correctness/quality of their summarization approach.

Used more in earlier works in summarization, not commonly used any more.

Large potential for bias

Difficult to replicate results

Human Study

The summarization approach is evaluated by external human participants

Gold standard for evaluation

Resource intensive

Automatic Evaluation

The summarization approach is evaluated using automatic metrics that serve as a proxy for human judgement.

Commonly used metrics: BLEU, ROUGE, METEOR.

Can be easily applied to large datasets.

Unclear as to how effective they are at approximating human judgement

Change in Evaluation Methods

Gradual shift from manual verification/human studies to automatic evaluation metrics

  • largely due to the adoption of deep learning techniques

Only 4/14 deep learning based papers from the last 4 years include human study to augment automatic evaluation

Datasets in Summarization

Datasets

Easily available from publicly hosted repositories on services like GitHub, BitBucket etc.

The landscape is fairly fragmented; there are no standard benchmark datasets, which makes comparing performance across approaches difficult

Changed significantly since the early days of summarization:

from small(1-10 projects) to extremely large (> 1000 projects)

Limitations of Existing Literature

Evaluation Problem

The Evaluation Problem

Currently, the majority of summarization papers use automatic metrics as their evaluation method

The metrics used are adapted from Neural Machine Translation (NMT)

However, the validity of these metrics in the context of code summarization remains unclear.

There are significant differences between datasets used in machine natural language processing, and datasets in code summarization (Gros et al)

On comprehension tasks, developers perform significantly better when using human written summaries vs automatically generated summaries (Stapleton et al)

The Evaluation Problem

The Evaluation Problem

Do metrics effectively approximate human judgmement in summarization?

Do improvements in metric scores translate to tangible improvements in real world performance?

Proposed Solution

Investigate the validity of the use of automatic evaluation metrics in code summarization

There is a rich body of literature in Neural Machine Translation (NMT) on how to conduct these studies and analyze the data collected

  • NMT has a yearly conference known as WMT dedicated to this

Proposed Solution

Establish the efficacy of automatic evaluation metrics in their ability to serve as approximations of human judgement

  • conduct human studies comparing summarization approaches, and use human judgement as gold standard to evaluate metrics
  • using the human studies, establish guidelines for interpretation of these automatic evaluation metrics
  • identify shortcomings of these metrics and draft measures to mitigate them

Dataset Problem

Heterogeneity

There are no standard datasets that are used as benchmarks to evaluate summarization techniques.

Makes it difficult to compare performance of different approaches if they don't use the same datasets for evaluation

Compounded by the fact that automatic evaluation metrics are highly sensitive to data preprocessing steps, such as tokenization (same dataset preprocessed differently can lead to high variance in results)

The Dataset Problem: Heterogeneity

The lack of standard benchmark datasets impedes research

  • Deep learning based summarization approaches routinely take > 50 hours to train on specialized hardware
  • this computational cost is often prohibitive when it comes to replication, reproduction and comparison

The Dataset Problem: Quality

It is unclear whether summarization datasets are representative of real world scenarios.

Papers generally report high level metrics like # of methods, projects, avg. loc etc.

  • but don't take into account what proportion of the dataset is actually useful in the context of summarization

The Dataset Problem: Quality

There is no literature on what should and shouldn't be summarized. eg. short methods might not need be summarized.

Without this knowledge, it is unclear whether improvements in performance on these datasets would translate to real world performance improvements

Outside of basic quality control measures, there is no explicit filtering of poor quality projects that are not appropriate for summarization

Proposed Solution

The research community must build high quality, easy to use benchmark datasets, such as the ones used in natural language processing (GLUE, bABI etc)

The first step is to establish desiderata for such a dataset using human studies, focusing on the needs of software developers

  • what artifacts do they want automatically summarized and what documentation would they rather themselves maintain?

Conclusion

Source code summarization has seen major changes and significant progress over the years

However, need to step back address some fundamental questions regarding evaluation/datasets.

Questions?