Creation of a
dataset
from
Github
Jose María Alvarez-Rodríguez
25th of August, 2014
Wroclaw, Poland
Objective
Creation of a
dataset
containing
data
and
information
about
software development
What is Github?
Social
coding
Social
network
Development of software components
Data in Github?
~Static
The social network (users & repositories)
Dynamic
Events of every user & repository
Github
Data Model
http://www.dmst.aueb.gr/dds/pubs/conf/2012-MSR-GitHub/html/github-mirror.html
More specifically...
Users
Repositories
Issues
Milestones
Commits
Collaborators
Downloads
Pull-requests
Labels
Comments
Github
Events
(18 types)
IssuesEvent
MemberEvent
PageBuildEvent
PublicEvent
PullRequestEvent
PullRequestReview-CommentEvent
PushEvent
ReleaseEvent
StatusEvent
TeamAddEvent
WatchEvent
CommitCommentEvent
CreateEvent
DeleteEvent
DeploymentEvent
DeploymentStatusEvent
DownloadEvent
FollowEvent
ForkEvent
ForkApplyEvent
GistEvent
GollumEvent
IssueCommentEvent
How to
access data
?
Github API in different programming languages
Direct JSON requests
https://api.github.com/
Github Archive (daily events)
http://www.githubarchive.org/
Google Big Query
https://bigquery.cloud.google.com/table/githubarchive:github.timeline
https://bigquery.cloud.google.com/table/publicdata:samples.github_nested
https://bigquery.cloud.google.com/table/publicdata:samples.github_timeline
Creation of the
Github dataset
Using the Github API library for Java
User and Repositories
Repositories: commits, collaborators, downloads, issues, labels and milestones
Serialization:
CSV files
Neo4j as a graph
Support also for querying Google Big Query from Java
My user
...more nodes...
Some
stats
&
limitations
8,526,145
(potential number of users, actually less than that)
400,000 users (logins have been gathered)
135
users are fully described
Some memory problems (because of commits)...
Neo4j in standlone-mode (no server)
5000
requests/hour per authenticated user
What's next?
Establish which
dataset
is more
relevant
...
Static
Dynamic (events)
Both
...full creation! (it takes time...)
Relational machine learning
model
Multimode
Timestamp in most of events and entities
Heuristics to determine the "
best projects
"
Followers
Downloads
Pull-requests
Commits
...
Related Works
SAGH: A Social Analysis tool for GitHub
A developer recommendation system (and other metrics)
http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-015-final.pdf
An Analysis of GitHub’s Collaborative Software Network
Link Prediction
http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-050-final.v01.pdf
Network Structure of Social Coding in GitHub
http://www.mysmu.edu/faculty/lxjiang/papers/csmr13github.pdf
http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=2686&context=sis_research
Coding Together at Scale: GitHub as a Collaborative Social Network
Analysis of the geographical activity
http://arxiv.org/abs/1407.2535
http://www.cs.bham.ac.uk/~musolesm/papers/icwsm14_github.pdf
Made with Slides.com