(Code Similarity Check)

GROUP MEMBERS:

Ali Ghulam - - - - - - - - - - - - - - (P17-6009)

Faizan Ahmad - - - - - - - - - - -  (P17-6020)

Muhammad Hafeez Ullah - -  (P17-6144)

SUPERVISOR:

Shoaib Muhammad Khan

Assistant Professor

FAST NUCES, Peshawar Campus

slides.com/faizanf33/code-similarity-check-02

Distance Metric for Source code 
Abstract Syntax trees

Introduction

  • Replicating or altering code (immorality).
  • The original creator of source code.
  • Students coding ability drops.
  • Find similarities in one specific language.

Problem Statement

  • Generate similarity reports for student code submissions by generating distance metric using abstract syntax trees.

Literature Review

Reference Basic Idea Method Results Limitations
[1] [BASE PAPER]
Winnowing: Local Algorithms for Document Fingerprinting (MOSS)
Uses winnowing algorithm to detect shortest match Records fingerprints and position of the fingerprints in the document Sequence of hashes
generated by hashing k-grams is independent and uniformly random
[2] Comparing Python Programs Using Abstract Syntax Trees Produces reports on the basis of similarity index The model can detect code similarity using sub-tree (partial) indexing Works on python language only
[3] Design pattern detection based on the graph theory Detecting design patterns using a semantic graph The model can detect similar patterns with high accuracy and efficiency
Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. "Winnowing: local algorithms for document fingerprinting." Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 2003.

[1]

Salazar Paredes, Pedro. Comparing python programs using abstract syntax trees. BS thesis. Uniandes, 2020.

[2]

Bahareh Bafandeh Mayvan, Abbas Rasoolzadegan, Design pattern detection based on the graph theory, Knowledge-Based Systems (2017)

[3]

Literature Review cont.

Reference Basic Idea Method Results Limitations
[4] Using Latent Semantic Analysis to Identify Similarities in Source Code to Support Program Understanding (SVD) Single Value Decomposition of a matrix derived from a corpus of natural text Captures significant portions of the meaning not only of individual words
[5] Euclidean Distance Matrices
Essential Theory, Algorithms and Applications
Design algorithms for completing and denoising distance data Position calibration, room reconstruction from echoes and phase retrieval.
Maletic, Jonathan I., and Andrian Marcus. "Using latent semantic analysis to identify similarities in source code to support program understanding." Proceedings 12th IEEE internationals conference on tools with artificial intelligence. ICTAI 2000. IEEE, 2000.

[4]

Dokmanic, Ivan, et al. "Euclidean distance matrices: essential theory, algorithms, and applications." IEEE Signal Processing Magazine 32.6 (2015): 12-30.

[5]

Use Case Diagram

Use Case Scenario

  • Upload high level source codes

Actor

  • Generate AST during syntax analysis
    

System

  • Analyse report
  • Create Distance Metric
  • Find similarity index using metric
    
  • Generate report
    

Sequence Diagram

Abstract Syntax Tree (JS)

var area = PI * (radius ** 2);

Proposed Methodology

  • Disassemble code
  • Generate abstract syntax tree
  • Create adjacent matrix
  • Calculate distance

Key Equation

  • Levenshtein distance (min. edit distance)

Benefit: Insert, delete, substitute operations are allowed

References

Salazar Paredes, Pedro. Comparing python programs using abstract syntax trees. BS thesis. Uniandes, 2020.

[2]

Bahareh Bafandeh Mayvan, Abbas Rasoolzadegan, Design pattern detection based on the graph theory, Knowledge-Based Systems (2017)

[3]

Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. "Winnowing: local algorithms for document fingerprinting." Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 2003.

[1]

Dokmanic, Ivan, et al. "Euclidean distance matrices: essential theory, algorithms, and applications." IEEE Signal Processing Magazine 32.6 (2015): 12-30.

[5]

Maletic, Jonathan I., and Andrian Marcus. "Using latent semantic analysis to identify similarities in source code to support program understanding." Proceedings 12th IEEE internationals conference on tools with artificial intelligence. ICTAI 2000. IEEE, 2000.

[4]

slides.com/faizanf33/code-similarity-check-02

Thank you for your precious time.

Any Suggestions?

Code Similarity Check - II

By Faizan Ahmad

Code Similarity Check - II

Using graph theory and program disassembly to create abstract syntax trees from code. These will be used to generate similarity reports for student code submissions in different languages including Python, Java, and C++.

  • 179