Learning to Identify Security-Related Issues Using Convolutional Neural Networks

 

David N. Palacio, Daniel McCrystal, Kevin Moran, Carlos Bernal-Cárdenas, Denys Poshyvanyk, Chris Shenefiel

Secure Development Lifecycle

Issue Tracker

Security Related

non-Security Related

We need to design automated approaches to identify whether issues describe Security-Related content

SecureReqNet

Components of Learning

  1. Inputs: Issues or Requirements (Natural Language)

Components of Learning

  1. Inputs: Issues or Requirements (Natural Language)
  2. Outputs: Security or Non-Security Related Issues

Components of Learning

  1. Inputs: Issues or Requirements (Natural Language)
  2. Outputs: Security or Non-Security Related Issues
  3. Target Function: To identify security-related content

Components of Learning

\approx
  1. Inputs: Issues or Requirements (Natural Language)
  2. Outputs: Security or Non-Security Related Issues
  3. Target Function: To identify security critical issues
  4. Data: Large number of issues

Components of Learning

  1. Inputs: Issues or Requirements (Natural Language)
  2. Outputs: Security or Non-Security Related Issues
  3. Target Function: To identify security critical issues
  4. Data: Large number of issues 
  5. Learning Model: Shallow/Deep Convolutional Nets

(Shallow) SecureReqNet

(Deep) SecureReqNet

Alex-SecureReqNet

α-SecureReqNet

Components of Learning

Input Layer

Sentence Embedding (skip-gram)

1-Conv Layer

max pooling 2D

words

7-gram

32 features

5-gram

2-Conv Layer

64 features

max pooling 2D

3-gram

128 features

3-Conv Layer

4-Conv Layer

64 features

5-Conv Layer

max pooling 2D

flatten

flatten

flatten

Fully Connected

3-gram

α-SecureReqNet

Binary

Softmax

Unsupervised Pre-Training

Supervised Binary Classifier

Skip-Gram Model

Tailored Convolutional Neural Networks

Input Layer

word

samples:

1, 20/100

Merge (dot)

Embedding Layer

context

samples: 1

Reshaping

samples:

20/100

samples: 1

Sigmoid

Unsupervised Embedding

Unsupervised Pre-Training

{'attack': ['network', 'exploit', 'unauthor'],
 'code': ['execut', 'inform', 'special'],
 'exploit': ['success', 'network', 'attack']}

Unsupervised Pre-Training

Supervised Binary Classifier

Skip-Gram Model

Distinct Convolutional Neural Networks

Empirical Evaluation

GitLab Issues (SR:578 | non-SR:578)

Supervised Binary Classifier

GitHub Issues (SR:4,515 | non-SR:47,483)

GitLab Issues (SR:578 | non-SR:578)

Supervised Binary Classifier

GitHub Issues (SR:4,515 | non-SR:47,483)

 (SR:52,908)

GitLab Issues (SR:578 | non-SR:578)

Industry Data-set (SR:24 and non-SR:45)

Supervised Binary Classifier

GitHub Issues (SR:4,515 | non-SR:47,483)

 (SR:52,908)

Unsupervised Pre-Training

Wikipedia Corpus of 10,000 articles

Common Vulnerabilities and Exposures (CVE) database of 52,908 instances

Results

Cross-Entropy and Acc. Analysis for the shallow SecureReqNet:

80% Training / 10% Validation

Optimal Capacity at 101 epochs

Deep

Performance Metrics for Alex-SecureReqNet:

Industry Dataset / 10% Testing

Deep

Performance Metrics for α-SecureReqNet:

Industry Dataset / 10% Testing

Deep

Performance Metrics for Shallow-SecureReqNet:

Industry Dataset / 10% Testing

Shallow

Summary

Issue Tracker

Issue Tracker

Issue Tracker

Issue Tracker

(Shallow) SecureReqNet

α-SecureReqNet

Thank You!

SecureReqNet GitHub