Git-a-rec

Github Data Analysis and Recommender System

Project Mentor:

Asst. Prof. Anuj Mahajan

FCSE, SMVDU

Presented By:

Akshay Pratap(2011ECS01)

Rishabh Shukla(2011ECS13)

Github

Github is a web-based Open Source Contribution and version control platform, where developers from all around the world contribute into Open-source projects(Repositories)

A typical Github Repository

Data Analysis of huge amount of open Github Data, where we tried to find some deep patterns among popularity and spatial distributions of programming languages and users on Github.

"Git-a"-rec

It further employs a content-based filtering approach, coupled with Apache Spark to develop a recommender system, for Github users.

Git-a-"rec"

Data Analysis Pipeline

Github Data Analysis

Data Acquisition
- Downloaded mongoDB dumps from Github Open Data Platform
Data cleaning/wrangling
- Created MongoDB collections for respective datasets
- Removed redundant data fields
- Removed duplicate documents from collections
- Exported mongoDB collections to CSV files
Pre-processing Data
- Created Data Frames for R Visualizations
- Fixed inconsistent Documents
- Getting coordinates from "location" strings using geopy

...Github Data Analysis

Inferential Statistics
- Programming languages trends for last 4 years
- Statistics of users from various Companies on Github
- Programming languages used in various companies
- Programming Language Demographics
- User Demographics
Code Optimization
- Used R Big Data packages like data.table

Data Analysis Visualizations

Proportion of Users from various Companies

What languages are being used in various companies?

Ruby Spatial Density - USA

User Demographics - Europe

Data Analysis - Challenges and Limitations

Getting and Cleaning data of about 1.4 million Github users and more than 4 million repos
Transferring huge mongo dumps to local mongoDB database
Converting mongoDB documents to R Data Frames
Big Data Manipulation - merge, group by, row subsetting
Plotting large number of spatial coordinates
Primary focus was Open Source Technologies

Languages and Libraries Used

R - ggplot2, data.table, leaflet, dplyr
Python - pymongo, geopy, flask
MongoDB - noSQL Database

Recommender System

Uses existing dataset and content-based filtering to create similarity matrices
Maps users working with a language to similar repositories
Rates and recommends repositories-to-work-on to users

Overview of Recommender System

High-Level Architecture

cosine-similarity

Returns a bounded [0,1] value, with similar users having higher value and vice-versa.

User - Language Matrix

Mapper - Reducer Code

Apache Spark Algorithms

Sparse - Matrix Representation for Similarity

Scaling and Bias formula

Challenges and Limitation of Recommender System

Manipulation of huge amount of data, used MapReduce for initial data preperation
High Dimensional Matrix Multiplication, used Apache Spark
Sparse Dataset. Could be more accurate with more data
Limited Computational Power on local machines

Technologies Used

Scala
Python
MapReduce
Hadoop File System(HDFS)
Apache Spark - Mllib, CoordinateMatrix

Future Work

Running the whole Cluster over Amazon EC2 - Distributed Computing Systems
Further Data Analysis of organizations, hire-able users
Creating a Web-App for recommendation System
Using DIMSUM for optimized matrix multiplication
Interfacing with Github API for more dynamic recommendations

Conclusions

It is imperative for programmers to keep up with latest technologies in the computer science field. This analysis of Github Data provides an overview of technologies being used around the globe and even the spatial distributions of these technologies as well as of users.

Furthermore, Github Recommendation system is an attempt to bring more contributions to open source world by providing personalized recommendations about Github repositories.

Thank you.

Git-a-rec

By Rishabh Shukla

Git-a-rec

10 years ago
1,869

Git-a-rec

Github

A typical Github Repository

"Git-a"-rec

Git-a-"rec"

Data Analysis Pipeline

Github Data Analysis

...Github Data Analysis

Data Analysis Visualizations

Data Analysis - Challenges and Limitations

Languages and Libraries Used

Recommender System

Overview of Recommender System

High-Level Architecture

cosine-similarity

User - Language Matrix

Mapper - Reducer Code

Apache Spark Algorithms

Sparse - Matrix Representation for Similarity

Scaling and Bias formula

Challenges and Limitation of Recommender System

Technologies Used

Future Work

Conclusions

Git-a-rec

More from Rishabh Shukla