Git-a-rec
Github Data Analysis and Recommender System
Project Mentor:
Asst. Prof. Anuj Mahajan
FCSE, SMVDU
Presented By:
Akshay Pratap(2011ECS01)
Rishabh Shukla(2011ECS13)
Github
Github is a web-based Open Source Contribution and version control platform, where developers from all around the world contribute into Open-source projects(Repositories)
A typical Github Repository
Data Analysis of huge amount of open Github Data, where we tried to find some deep patterns among popularity and spatial distributions of programming languages and users on Github.
"Git-a"-rec
It further employs a content-based filtering approach, coupled with Apache Spark to develop a recommender system, for Github users.
Git-a-"rec"
Data Analysis Pipeline
Github Data Analysis
-
Data Acquisition
- Downloaded mongoDB dumps from Github Open Data Platform
-
Data cleaning/wrangling
- Created MongoDB collections for respective datasets
- Removed redundant data fields
- Removed duplicate documents from collections
- Exported mongoDB collections to CSV files
-
Pre-processing Data
- Created Data Frames for R Visualizations
- Fixed inconsistent Documents
- Getting coordinates from "location" strings using geopy
...Github Data Analysis
-
Inferential Statistics
- Programming languages trends for last 4 years
- Statistics of users from various Companies on Github
- Programming languages used in various companies
- Programming Language Demographics
- User Demographics
-
Code Optimization
- Used R Big Data packages like data.table
Data Analysis Visualizations
Proportion of Users from various Companies
What languages are being used in various companies?
Ruby Spatial Density - USA
User Demographics - Europe
Data Analysis - Challenges and Limitations
- Getting and Cleaning data of about 1.4 million Github users and more than 4 million repos
- Transferring huge mongo dumps to local mongoDB database
- Converting mongoDB documents to R Data Frames
- Big Data Manipulation - merge, group by, row subsetting
- Plotting large number of spatial coordinates
- Primary focus was Open Source Technologies
Languages and Libraries Used
- R - ggplot2, data.table, leaflet, dplyr
- Python - pymongo, geopy, flask
- MongoDB - noSQL Database
Recommender System
- Uses existing dataset and content-based filtering to create similarity matrices
- Maps users working with a language to similar repositories
- Rates and recommends repositories-to-work-on to users
Overview of Recommender System
High-Level Architecture
cosine-similarity
Returns a bounded [0,1] value, with similar users having higher value and vice-versa.
User - Language Matrix
Mapper - Reducer Code
Apache Spark Algorithms
Sparse - Matrix Representation for Similarity
Scaling and Bias formula
Challenges and Limitation of Recommender System
- Manipulation of huge amount of data, used MapReduce for initial data preperation
- High Dimensional Matrix Multiplication, used Apache Spark
- Sparse Dataset. Could be more accurate with more data
- Limited Computational Power on local machines
Technologies Used
- Scala
- Python
- MapReduce
- Hadoop File System(HDFS)
- Apache Spark - Mllib, CoordinateMatrix
Future Work
- Running the whole Cluster over Amazon EC2 - Distributed Computing Systems
- Further Data Analysis of organizations, hire-able users
- Creating a Web-App for recommendation System
- Using DIMSUM for optimized matrix multiplication
- Interfacing with Github API for more dynamic recommendations
Conclusions
It is imperative for programmers to keep up with latest technologies in the computer science field. This analysis of Github Data provides an overview of technologies being used around the globe and even the spatial distributions of these technologies as well as of users.
Furthermore, Github Recommendation system is an attempt to bring more contributions to open source world by providing personalized recommendations about Github repositories.
Thank you.
Git-a-rec
By Rishabh Shukla
Git-a-rec
- 1,805