dissertation

Deep Learning Software Repositories

Martin White advised by Denys Poshyvanyk

Deep Learning Software Repositories

Martin White advised by Denys Poshyvanyk

Dissertation

Toward deep learning software repositories
Deep learning code fragments for code clone detection
Sorting and transforming program repair ingredients via deep learning

p(s)=p(w_1,w_2,\ldots,w_m)\approx\prod_1^mp(w_i|w_{i-1})\approx\prod_1^m\frac{c(w_{i-1}w_i)}{c(w_{i-1})}

p(s)=p(w_1,w_2,\ldots,w_m)\approx\prod_1^mp(w_i|w_{i-1})\approx\prod_1^m\frac{c(w_{i-1}w_i)}{c(w_{i-1})}

\text{John read Moby Dick.}

\text{John read Moby Dick.}

\text{Mary read a different book.}

\text{Mary read a different book.}

\text{She read a book by Cher.}

\text{She read a book by Cher.}

p(\text{John},\text{read},\text{a},\text{book})=\frac{1}{3}\times\frac{1}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

p(\text{John},\text{read},\text{a},\text{book})=\frac{1}{3}\times\frac{1}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

p(\text{Cher},\text{read},\text{a},\text{book})=\frac{0}{3}\times\frac{0}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx \text{uh oh}

p(\text{Cher},\text{read},\text{a},\text{book})=\frac{0}{3}\times\frac{0}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx \text{uh oh}

Example

p(\color{red}{\text{John}},\text{read},\text{a},\text{book})=\color{red}{\frac{1}{3}}\times\frac{1}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

p(\color{red}{\text{John}},\text{read},\text{a},\text{book})=\color{red}{\frac{1}{3}}\times\frac{1}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

\text{\color{red}{John} read Moby Dick.}

\text{\color{red}{John} read Moby Dick.}

\text{\color{red}{John read} Moby Dick.}

\text{\color{red}{John read} Moby Dick.}

p(\color{red}{\text{John}},\color{red}{\text{read}},\text{a},\text{book})=\frac{1}{3}\times\color{red}{\frac{1}{1}}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

p(\color{red}{\text{John}},\color{red}{\text{read}},\text{a},\text{book})=\frac{1}{3}\times\color{red}{\frac{1}{1}}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

\text{John \color{red}{read Moby} Dick.}

\text{John \color{red}{read Moby} Dick.}

p(\text{John},\color{red}{\text{read}},\color{red}{\text{a}},\text{book})=\frac{1}{3}\times\frac{1}{1}\times\color{red}{\frac{2}{3}}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

p(\text{John},\color{red}{\text{read}},\color{red}{\text{a}},\text{book})=\frac{1}{3}\times\frac{1}{1}\times\color{red}{\frac{2}{3}}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

\text{Mary \color{red}{read a} different book.}

\text{Mary \color{red}{read a} different book.}

\text{She \color{red}{read a} book by Cher.}

\text{She \color{red}{read a} book by Cher.}

p(\text{John},\text{read},\color{red}{\text{a}},\color{red}{\text{book}})=\frac{1}{3}\times\frac{1}{1}\times\frac{2}{3}\times\color{red}{\frac{1}{2}}\times\frac{1}{2}\approx 0.06

p(\text{John},\text{read},\color{red}{\text{a}},\color{red}{\text{book}})=\frac{1}{3}\times\frac{1}{1}\times\frac{2}{3}\times\color{red}{\frac{1}{2}}\times\frac{1}{2}\approx 0.06

\text{John read Moby Dick.}

\text{John read Moby Dick.}

\text{Mary read \color{red}{a different} book.}

\text{Mary read \color{red}{a different} book.}

\text{She read \color{red}{a book} by Cher.}

\text{She read \color{red}{a book} by Cher.}

p(\text{John},\text{read},\text{a},\color{red}{\text{book}})=\frac{1}{3}\times\frac{1}{1}\times\frac{2}{3}\times\frac{1}{2}\times\color{red}{\frac{1}{2}}\approx 0.06

p(\text{John},\text{read},\text{a},\color{red}{\text{book}})=\frac{1}{3}\times\frac{1}{1}\times\frac{2}{3}\times\frac{1}{2}\times\color{red}{\frac{1}{2}}\approx 0.06

\text{Mary read a different \color{red}{book.}}

\text{Mary read a different \color{red}{book.}}

\text{She read a \color{red}{book by} Cher.}

\text{She read a \color{red}{book by} Cher.}

p(\text{John},\text{read},\text{a},\text{book})=\frac{1}{3}\times\frac{1}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

p(\text{John},\text{read},\text{a},\text{book})=\frac{1}{3}\times\frac{1}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx 0.06

\text{Mary read a different book.}

\text{Mary read a different book.}

\text{She read a book by Cher.}

\text{She read a book by Cher.}

p(\color{red}{\text{Cher}},\text{read},\text{a},\text{book})=\color{red}{\frac{0}{3}}\times\frac{0}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx \text{uh oh}

p(\color{red}{\text{Cher}},\text{read},\text{a},\text{book})=\color{red}{\frac{0}{3}}\times\frac{0}{1}\times\frac{2}{3}\times\frac{1}{2}\times\frac{1}{2}\approx \text{uh oh}

\text{She read a book by \color{red}{Cher}.}

\text{She read a book by \color{red}{Cher}.}

Top-k Accuracy (%)
Model	Top-1	Top-5	Top-10
Interpolated 8-gram	49.7	71.3	78.1
Interpolated 8-gram 100-cache	04.8	69.5	78.5
Static (400, 5)	61.1	78.4	81.4
Dynamic (300, 20)	72.2	88.4	92.0

Deep Learning Code at the Lexical Level

We use a recurrent neural network to map terms to embeddings

y(i)=g(\gamma f({\color{red}\alpha} t(i)+\beta z(i-1)))

y(i)=g(\gamma f({\color{red}\alpha} t(i)+\beta z(i-1)))

What we would like to have is not only embeddings for terms but also embeddings for fragments

\color{white}{y(i)=g(\gamma f({\alpha}} t(i)\color{white}{+\beta z(i-1)))}

\color{white}{y(i)=g(\gamma f({\alpha}} t(i)\color{white}{+\beta z(i-1)))}

y(i)=g(\gamma f({\color{red}\alpha} t(i)+\beta z(i-1)))

y(i)=g(\gamma f({\color{red}\alpha} t(i)+\beta z(i-1)))

\color{white}{y(i)=g(\gamma f(}{\color{red}\alpha} t(i)\color{white}{+\beta z(i-1)))}

\color{white}{y(i)=g(\gamma f(}{\color{red}\alpha} t(i)\color{white}{+\beta z(i-1)))}

\color{white}{y(i)=g(\gamma f(}{\color{red}\alpha} t(i)\color{white}{+\beta} z(i-1)\color{white}{))}

\color{white}{y(i)=g(\gamma f(}{\color{red}\alpha} t(i)\color{white}{+\beta} z(i-1)\color{white}{))}

\color{white}{y(i)=g(\gamma f(}{\color{red}\alpha} t(i)\color{white}{+}\beta z(i-1)\color{white}{))}

\color{white}{y(i)=g(\gamma f(}{\color{red}\alpha} t(i)\color{white}{+}\beta z(i-1)\color{white}{))}

\color{white}{y(i)=g(\gamma f(}{\color{red}\alpha} t(i)+\beta z(i-1)\color{white}{))}

\color{white}{y(i)=g(\gamma f(}{\color{red}\alpha} t(i)+\beta z(i-1)\color{white}{))}

\color{white}{y(i)=g(\gamma} f({\color{red}\alpha} t(i)+\beta z(i-1))\color{white}{)}

\color{white}{y(i)=g(\gamma} f({\color{red}\alpha} t(i)+\beta z(i-1))\color{white}{)}

\color{white}{y(i)=g(}\gamma f({\color{red}\alpha} t(i)+\beta z(i-1))\color{white}{)}

\color{white}{y(i)=g(}\gamma f({\color{red}\alpha} t(i)+\beta z(i-1))\color{white}{)}

\color{white}{y(i)=}g(\gamma f({\color{red}\alpha} t(i)+\beta z(i-1)))

\color{white}{y(i)=}g(\gamma f({\color{red}\alpha} t(i)+\beta z(i-1)))

Deep Learning Code at the Syntax Level

We use a recursive autoencoder to encode sequences of embeddings

y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]))

y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]))

AST-based encoding

Greedy encoding

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f([{\varepsilon_{\ell}},{\varepsilon_{r}}]} [x_{\ell};x_r]\color{white}{))}

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f([{\varepsilon_{\ell}},{\varepsilon_{r}}]} [x_{\ell};x_r]\color{white}{))}

y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]))

y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]))

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f(}[\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]\color{white}{))}

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f(}[\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]\color{white}{))}

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r]} f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r])\color{white}{)}

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r]} f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r])\color{white}{)}

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=g(}[\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r])\color{white}{)}

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=g(}[\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r])\color{white}{)}

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=}g([\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]))

\color{white}{y=[\hat{x_{\ell}};\hat{x_{r}}]=}g([\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]))

\color{white}{y=}[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]))

\color{white}{y=}[\hat{x_{\ell}};\hat{x_{r}}]=g([\delta_{\ell};\delta_r] f([\color{red}{\varepsilon_{\ell}},\color{red}{\varepsilon_{r}}] [x_{\ell};x_r]))

Subject Systems' Statistics
System	Files	LOC	Tokens	Vocab.
ANTLR 4	1,514	104,225	1,701,807	15,826
Apache Ant 1.9.6	1,218	136,352	1,888,424	16,029
ArgoUML 0.34	1,908	177,493	1,172,058	17,205
CAROL 2.0.5	1,184	112,022	1,180,947	12,210
dnsjava 2.0.0	1,196	124,660	1,169,219	13,012
Hibernate 2	1,555	151,499	1,365,256	15,850
JDK 1.4.2	4,129	562,120	3,512,807	45,107
JHotDraw 6	1,984	158,130	1,377,652	14,803

Empirical Study

Subject Systems' Statistics
System	Files	LOC	Tokens	Vocab.
ANTLR 4	1,514	104,225	1,701,807	15,826
Apache Ant 1.9.6	1,218	136,352	1,888,424	16,029
ArgoUML 0.34	1,908	177,493	1,172,058	17,205
CAROL 2.0.5	1,184	112,022	1,180,947	12,210
dnsjava 2.0.0	1,196	124,660	1,169,219	13,012
Hibernate 2	1,555	151,499	1,365,256	15,850
JDK 1.4.2	4,129	562,120	3,512,807	45,107
JHotDraw 6	1,984	158,130	1,377,652	14,803

Subject Systems' Statistics
System	Files	LOC	Tokens	Vocab.
ANTLR 4	1,514	104,225	1,701,807	15,826
Apache Ant 1.9.6	1,218	136,352	1,888,424	16,029
ArgoUML 0.34	1,908	177,493	1,172,058	17,205
CAROL 2.0.5	1,184	112,022	1,180,947	12,210
dnsjava 2.0.0	1,196	124,660	1,169,219	13,012
Hibernate 2	1,555	151,499	1,365,256	15,850
JDK 1.4.2	4,129	562,120	3,512,807	45,107
JHotDraw 6	1,984	158,130	1,377,652	14,803

C. Corley, K. Damevski, N. Kraft, Exploring the use of deep learning for feature location, ICSME'15
A. Lam, A. Nguyen, H. Nguyen, T. Nguyen, Combining deep learning with information retrieval to localize buggy files for bug reports, ASE'15
S. Wang, T. Liu, L. Tan, Automatically learning semantic features for defect prediction, ICSE'16
C. Alexandru, Guided code synthesis using deep neural networks, FSE'16
H. Dam, T. Tran, J. Grundy, A. Ghose, DeepSoft: A vision for a deep model of software, FSE'16
X. Gu, H. Zhang, D. Zhang, S. Kim, Deep API learning, FSE'16
C. Chen, Z. Xing, SimilarTech: Automatically recommend analogical libraries across different programming languages, ASE'16
G. Chen, C. Chen, Z. Xing, B. Xu, Learning a dual-language vector space for domain-specific cross-lingual question retrieval, ASE'16
M. White, M. Tufano, C. Vendome, D. Poshyvanyk, Deep learning code fragments for code clone detection, ASE'16
B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, S. Li, Predicting semantically linkable knowledge in developer online forums via convolutional neural network, ASE'16
C. Alexandru, S. Panichella, H. Gall, Replicating parser behavior using neural machine translation, ICPC'17
A. Lam, A. Nguyen, H. Nguyen, T. Nguyen, Bug localization with combination of deep learning and information retrieval, ICPC'17
J. Guo, J. Cheng, J. Cleland-Huang, Semantically enhanced software traceability using deep learning techniques, ICSE'17
P. Liu, X. Zhang, M. Pistoia, Y. Zheng, M. Marques, L. Zeng, Automatic text input generation for mobile testing, ICSE'17
A. Sankaran, R. Aralikatte, S. Mani, S. Khare, N. Panwar, N. Gantayat, DARVIZ: Deep abstract representation, visualization, and verification of deep learning models, ICSE-NIER'17
J. Wang, Q. Cui, S. Wang, Q. Wang, Domain adaptation for test report classification in crowdsourced testing, ICSE-SEIP'17
X. Liu, X. Lu, H. Li, T. Xie, Q. Mei, H. Mei, F. Feng, Understanding diverse usage patterns from large-scale appstore-service profiles, TSE'17
M. Choetkiertikul, H. Dam, T. Tran, A. Ghose, Predicting the delay of issues with due dates in software projects, EMSE'17

p(\text{John read a book})

p(\text{John read a book})

\text{John read a book}

\text{John read a book}

p(\text{John},\text{read},\text{a},\text{book})

p(\text{John},\text{read},\text{a},\text{book})

p(\text{John},\text{read},\text{a},\text{book})=p(\text{John})p(\text{read}|\text{John})\cdots p(\text{book}|\text{John},\text{read},\text{a})

p(\text{John},\text{read},\text{a},\text{book})=p(\text{John})p(\text{read}|\text{John})\cdots p(\text{book}|\text{John},\text{read},\text{a})

p(w_1,w_2,\ldots,w_m)=\prod_1^mp(w_i|w_1,\ldots,w_{i-1})

p(w_1,w_2,\ldots,w_m)=\prod_1^mp(w_i|w_1,\ldots,w_{i-1})

p(w_1,w_2,\ldots,w_m)=\prod_1^mp(w_i|w_1,\ldots,w_{i-1})\approx \prod_1^mp(w_i|w_{i-n+1},\ldots,w_{i-1})

p(w_1,w_2,\ldots,w_m)=\prod_1^mp(w_i|w_1,\ldots,w_{i-1})\approx \prod_1^mp(w_i|w_{i-n+1},\ldots,w_{i-1})

p(w_1,w_2,\ldots,w_m)\approx \prod_1^mp(w_i|w_{i-n+1},\ldots,w_{i-1})

p(w_1,w_2,\ldots,w_m)\approx \prod_1^mp(w_i|w_{i-n+1},\ldots,w_{i-1})

p(w_1,w_2,\ldots,w_m)\approx \prod_1^mp(w_i|w_{i-n+1},\ldots,w_{i-1})\approx \prod_1^m\frac{c(w_{i-1}w_i)}{c(w_{i-1})}

p(w_1,w_2,\ldots,w_m)\approx \prod_1^mp(w_i|w_{i-n+1},\ldots,w_{i-1})\approx \prod_1^m\frac{c(w_{i-1}w_i)}{c(w_{i-1})}

p(w_i|w_{i-1})=\frac{c(w_{i-1}w_i)+1}{c(w_{i-1})+|\mathcal{V}|}

p(w_i|w_{i-1})=\frac{c(w_{i-1}w_i)+1}{c(w_{i-1})+|\mathcal{V}|}

p(\text{John},\text{read},\text{a},\text{book})=\frac{1+1}{3+11}\times\frac{1+1}{1+11}\times\frac{2+1}{3+11}\times\frac{1+1}{2+11}\times\frac{1+1}{2+11}\approx 0.0001

p(\text{John},\text{read},\text{a},\text{book})=\frac{1+1}{3+11}\times\frac{1+1}{1+11}\times\frac{2+1}{3+11}\times\frac{1+1}{2+11}\times\frac{1+1}{2+11}\approx 0.0001

p(\text{Cher},\text{read},\text{a},\text{book})=\frac{0+1}{3+11}\times\frac{0+1}{1+11}\times\frac{2+1}{3+11}\times\frac{1+1}{2+11}\times\frac{1+1}{2+11}\approx 0.00003

p(\text{Cher},\text{read},\text{a},\text{book})=\frac{0+1}{3+11}\times\frac{0+1}{1+11}\times\frac{2+1}{3+11}\times\frac{1+1}{2+11}\times\frac{1+1}{2+11}\approx 0.00003

Smoothing

Smoothing adjusts MLEs, e.g., hallucinate data

Reconsider the example using this new distribution

Back-off and interpolation are two methods for redistributing mass

p_B(w_i|w_{i-n+1}^{i-1\color{white}{+n}})=

p_B(w_i|w_{i-n+1}^{i-1\color{white}{+n}})=

p_I(w_i|w_{i-n+1}^{i-1\color{white}{+n}})=

p_I(w_i|w_{i-n+1}^{i-1\color{white}{+n}})=

\delta(w_i|w_{i-n+1}^{i-1\color{white}{+n}})+\beta(w_{i-n+1}^{i-1\color{white}{+n}})p_I(w_i|w_{i-n+2}^{i-1\color{white}{+n}})

\delta(w_i|w_{i-n+1}^{i-1\color{white}{+n}})+\beta(w_{i-n+1}^{i-1\color{white}{+n}})p_I(w_i|w_{i-n+2}^{i-1\color{white}{+n}})

\delta(w_i|w_{i-n+1}^{i-1\color{white}{+n}})

\delta(w_i|w_{i-n+1}^{i-1\color{white}{+n}})

\beta(w_{i-n+1}^{i-1\color{white}{+n}})p_B(w_i|w_{i-n+2}^{i-1\color{white}{+n}})

\beta(w_{i-n+1}^{i-1\color{white}{+n}})p_B(w_i|w_{i-n+2}^{i-1\color{white}{+n}})

c(w_{i-n+1}^{i-1\color{white}{+n}})>k

c(w_{i-n+1}^{i-1\color{white}{+n}})>k

\text{otherwise}

\text{otherwise}

p(w_i|w_{i-1})=\frac{c(w_{i-1}w_i)}{c(w_{i-1})}

p(w_i|w_{i-1})=\frac{c(w_{i-1}w_i)}{c(w_{i-1})}

Empirical Results - RQ3

RQ3. Are our source code representations suitable for detecting fragments that are similar with respect to a clone type?
Sampled 398 from 1,500+ file pairs, 480 from 60,000+ method pairs

System	AST-based	Greedy	AST-based	Greedy
ANTLR	197	100	100	100
Apache Ant	192	193	100	100
ArgoUML	190	100	100	100
CAROL	100	100	100	100
dnsjava	147	100	173	187
Hibernate	100	100	153	170
JDK	190	100	100	100
JHotDraw	100	100	100	100

File-level

Method-level

Precision Results (%)

Deep Learning Software Repositories

dissertation

More from martingwhite