Devjeet Roy
QE Oral Presentation
Introduction
Software Developers spend a significant portion of their time comprehending source code
Proper documentation and inline comments in source code is vital to the comprension process.
Maintaining documentation is resource intensitve
Automatic Source Code Summarization
The automatic generation of source code documentation without human interventation
Approaches exist that apply to a wide range of source code artifacts:
* methods, classes, source code segments
* version control commits
* release notes
This survey
+ Abstractive Summarization only
+ Source code intrinsic documentation only (excludes generation of release notes, commit messages and other extrinsic documentation)
- Excludes crowd-sourcing based approaches (which are sometimes included in literature reviews)
Template Based Approaches
Relies on pre-defined templates for summary generation.
Templating systems range from simple, text templates to complex Natural Language Generation (NLG) systems.
Typically use information retrieval (IR) and/or program analysis to fill templates
Deep Learning
Machine learning technique
Generates full sentences directly without the use of templates
Typically trained on large amounts of data for state-of-the-art performance
Templates vs Deep Learning
Template based approaches require significant amounts of human effort to build/maintain templates
Deep learning based approaches are expensive computationally and require large amounts of data
However, deep learning generalizes better, and can often be more natural.
Templates vs Deep Learning
The first deep learning based summarization approaches appeared in 2016.
Ever since 2018, most summarization approaches are based on deep learning
Manual Verification
Researchers manually verify correctness/quality of their summarization approach.
Used more in earlier works in summarization, not commonly used any more.
Large potential for bias
Difficult to replicate results
Human Study
The summarization approach is evaluated by external human participants
Gold standard for evaluation
Resource intensive
Automatic Evaluation
The summarization approach is evaluated using automatic metrics that serve as a proxy for human judgement.
Commonly used metrics: BLEU, ROUGE, METEOR.
Can be easily applied to large datasets.
Unclear as to how effective they are at approximating human judgement
Change in Evaluation Methods
Gradual shift from manual verification/human studies to automatic evaluation metrics
Only 4/14 deep learning based papers from the last 4 years include human study to augment automatic evaluation
Datasets
Easily available from publicly hosted repositories on services like GitHub, BitBucket etc.
The landscape is fairly fragmented; there are no standard benchmark datasets, which makes comparing performance across approaches difficult
Changed significantly since the early days of summarization:
from small(1-10 projects) to extremely large (> 1000 projects)
The Evaluation Problem
Currently, the majority of summarization papers use automatic metrics as their evaluation method
The metrics used are adapted from Neural Machine Translation (NMT)
However, the validity of these metrics in the context of code summarization remains unclear.
There are significant differences between datasets used in machine natural language processing, and datasets in code summarization (Gros et al)
On comprehension tasks, developers perform significantly better when using human written summaries vs automatically generated summaries (Stapleton et al)
The Evaluation Problem
The Evaluation Problem
Do metrics effectively approximate human judgmement in summarization?
Do improvements in metric scores translate to tangible improvements in real world performance?
Proposed Solution
Investigate the validity of the use of automatic evaluation metrics in code summarization
There is a rich body of literature in Neural Machine Translation (NMT) on how to conduct these studies and analyze the data collected
Proposed Solution
Establish the efficacy of automatic evaluation metrics in their ability to serve as approximations of human judgement
Heterogeneity
There are no standard datasets that are used as benchmarks to evaluate summarization techniques.
Makes it difficult to compare performance of different approaches if they don't use the same datasets for evaluation
Compounded by the fact that automatic evaluation metrics are highly sensitive to data preprocessing steps, such as tokenization (same dataset preprocessed differently can lead to high variance in results)
The Dataset Problem: Heterogeneity
The lack of standard benchmark datasets impedes research
The Dataset Problem: Quality
It is unclear whether summarization datasets are representative of real world scenarios.
Papers generally report high level metrics like # of methods, projects, avg. loc etc.
The Dataset Problem: Quality
There is no literature on what should and shouldn't be summarized. eg. short methods might not need be summarized.
Without this knowledge, it is unclear whether improvements in performance on these datasets would translate to real world performance improvements
Outside of basic quality control measures, there is no explicit filtering of poor quality projects that are not appropriate for summarization
Proposed Solution
The research community must build high quality, easy to use benchmark datasets, such as the ones used in natural language processing (GLUE, bABI etc)
The first step is to establish desiderata for such a dataset using human studies, focusing on the needs of software developers
Conclusion
Source code summarization has seen major changes and significant progress over the years
However, need to step back address some fundamental questions regarding evaluation/datasets.