Git-a-rec

Github Data Analysis and Recommender System

Project Mentor:

Asst. Prof. Anuj Mahajan

FCSE, SMVDU

Presented By:

Akshay Pratap(2011ECS01)

Rishabh Shukla(2011ECS13)

 

Github

Github is a web-based Open Source Contribution and version control platform, where developers from all around the world contribute into Open-source projects(Repositories)

A typical Github Repository

 

Data Analysis of huge amount of open Github Data, where we tried to find some deep patterns among popularity and spatial distributions of programming languages and users on Github.

 

 

"Git-a"-rec

It further employs a content-based filtering approach, coupled with Apache Spark to develop a recommender system, for Github users.

Git-a-"rec"

 

Data Analysis Pipeline

Github Data Analysis 

  • Data Acquisition
    • ​Downloaded mongoDB dumps from Github Open Data Platform
  • Data cleaning/wrangling
    • Created MongoDB collections for respective datasets
    • Removed redundant data fields
    • Removed duplicate documents from collections
    • Exported mongoDB collections to CSV files
  • Pre-processing Data
    • Created Data Frames for R Visualizations
    • Fixed inconsistent Documents
    • Getting coordinates from "location" strings using geopy

...Github Data Analysis

  • Inferential Statistics
    • Programming languages trends for last 4 years
    • Statistics of users from various Companies on Github
    • Programming languages used in various companies
    • Programming Language Demographics  
    • User Demographics
  • Code Optimization
    • Used R Big Data packages like data.table

Data Analysis Visualizations

Proportion of Users from various Companies

What languages are being used in various companies?

Ruby Spatial Density - USA

User Demographics - Europe

Data Analysis - Challenges and Limitations

  • Getting and Cleaning data of about 1.4 million Github users and more than 4 million repos
  • Transferring huge mongo dumps to local mongoDB database
  • Converting mongoDB documents to R Data Frames
  • Big Data Manipulation - merge, group by, row subsetting
  • Plotting large number of spatial coordinates
  • Primary focus was Open Source Technologies 

Languages and Libraries Used

  • R - ggplot2, data.table, leaflet, dplyr
  • Python - pymongo, geopy, flask
  • MongoDB - noSQL Database 

Recommender System

  • Uses existing dataset and content-based filtering to create similarity matrices
  • Maps users working with a language to similar repositories
  • Rates and recommends repositories-to-work-on to users

Overview of Recommender System

High-Level Architecture

cosine-similarity

Returns a bounded [0,1] value, with similar users having higher value and vice-versa.

User - Language Matrix

Mapper - Reducer Code

Apache Spark Algorithms 

Sparse - Matrix Representation for Similarity

Scaling and Bias formula

Challenges and Limitation of Recommender System

  • Manipulation of huge amount of data, used MapReduce for initial data preperation
  • High Dimensional Matrix Multiplication, used Apache Spark
  • Sparse Dataset. Could be more accurate with more data
  • Limited Computational Power on local machines 

Technologies Used

  • Scala
  • Python
  • MapReduce
  • Hadoop File System(HDFS)
  • Apache Spark - Mllib, CoordinateMatrix

Future Work

  • Running the whole Cluster over Amazon EC2 - Distributed Computing Systems
  • Further Data Analysis of organizations, hire-able users
  • Creating a Web-App for recommendation System
  • Using DIMSUM for optimized matrix multiplication
  • Interfacing with Github API for more dynamic recommendations

Conclusions

It is imperative for programmers to keep up with latest technologies in the computer science field. This analysis of Github Data provides an overview of technologies being used around the globe and even the spatial distributions of these technologies as well as of users.

 

Furthermore, Github Recommendation system is an attempt to bring more contributions to open source world by providing personalized recommendations about Github repositories.

Thank you.

Git-a-rec

By Rishabh Shukla

Git-a-rec

  • 1,792