Creation of a dataset from Github
25th of August, 2014
Wroclaw, Poland
Objective
Creation of a dataset containing data and information about software development
What is Github?
- Social coding
- Social network
- Development of software components
Data in Github?
-
~Static
- The social network (users & repositories)
-
Dynamic
- Events of every user & repository
Github Data Model
More specifically...
- Users
- Repositories
- Issues
- Milestones
- Commits
- Collaborators
- Downloads
- Pull-requests
- Labels
- Comments
- Repositories
Github Events (18 types)
- IssuesEvent
- MemberEvent
- PageBuildEvent
- PublicEvent
- PullRequestEvent
- PullRequestReview-CommentEvent
- PushEvent
- ReleaseEvent
- StatusEvent
- TeamAddEvent
- WatchEvent
- CommitCommentEvent
- CreateEvent
- DeleteEvent
- DeploymentEvent
- DeploymentStatusEvent
- DownloadEvent
- FollowEvent
- ForkEvent
- ForkApplyEvent
- GistEvent
- GollumEvent
- IssueCommentEvent
How to access data?
- Github API in different programming languages
- Direct JSON requests
- Github Archive (daily events)
- Google Big Query
Creation of the Github dataset
- Using the Github API library for Java
- User and Repositories
- Repositories: commits, collaborators, downloads, issues, labels and milestones
- Serialization:
- CSV files
- Neo4j as a graph
- Support also for querying Google Big Query from Java
My user
...more nodes...
Some stats & limitations
-
8,526,145 (potential number of users, actually less than that)
- 400,000 users (logins have been gathered)
-
135 users are fully described
- Some memory problems (because of commits)...
- Neo4j in standlone-mode (no server)
- 5000 requests/hour per authenticated user
What's next?
- Establish which dataset is more relevant...
- Static
- Dynamic (events)
- Both
- ...full creation! (it takes time...)
- Relational machine learning model
- Multimode
- Timestamp in most of events and entities
- Heuristics to determine the "best projects"
- Followers
- Downloads
- Pull-requests
- Commits
- ...
Related Works
-
SAGH: A Social Analysis tool for GitHub
-
A developer recommendation system (and other metrics)
- http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-015-final.pdf
-
A developer recommendation system (and other metrics)
-
An Analysis of GitHub’s Collaborative Software Network
-
Link Prediction
- http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-050-final.v01.pdf
-
Link Prediction
-
Network Structure of Social Coding in GitHub
- http://www.mysmu.edu/faculty/lxjiang/papers/csmr13github.pdf
- http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=2686&context=sis_research
-
Coding Together at Scale: GitHub as a Collaborative Social Network
-
Analysis of the geographical activity
- http://arxiv.org/abs/1407.2535
- http://www.cs.bham.ac.uk/~musolesm/papers/icwsm14_github.pdf
-
Analysis of the geographical activity
Smargit Dataset creation
By Jose María Alvarez
Smargit Dataset creation
- 1,393