Creation of a dataset from Github

 
25th of August, 2014
Wroclaw, Poland
 

Objective

Creation of a dataset containing data and information about software development

What is Github?

  • Social coding
  • Social network
  • Development of software components

Data in Github?

  • ~Static
    • The social network (users & repositories)
  • Dynamic
    • Events of every user & repository

Github Data Model

More specifically...

  • Users
    • Repositories
      • Issues
      • Milestones
      • Commits
      • Collaborators
      • Downloads
      • Pull-requests
      • Labels
      • Comments

Github Events (18 types)

  • IssuesEvent
  • MemberEvent
  • PageBuildEvent
  • PublicEvent
  • PullRequestEvent
  • PullRequestReview-CommentEvent
  • PushEvent
  • ReleaseEvent
  • StatusEvent
  • TeamAddEvent
  • WatchEvent
  • CommitCommentEvent
  • CreateEvent
  • DeleteEvent
  • DeploymentEvent
  • DeploymentStatusEvent
  • DownloadEvent
  • FollowEvent
  • ForkEvent
  • ForkApplyEvent
  • GistEvent
  • GollumEvent
  • IssueCommentEvent​

How to access data?

Creation of the Github dataset

  • Using the Github API library for Java
    • User and Repositories
    • Repositories: commits, collaborators, downloads, issues, labels and milestones
    • Serialization:
      • CSV files
      • Neo4j as a graph
  • Support also for querying Google Big Query from Java

My user

...more nodes...

Some stats limitations

  • 8,526,145 (potential number of users, actually less than that)
    • 400,000 users (logins have been gathered)
  • 135 users are fully described 
    • Some memory problems (because of commits)...
    • Neo4j in standlone-mode (no server)
  • 5000 requests/hour per authenticated user

What's next?

  • Establish which dataset is more relevant...
    • Static
    • Dynamic (events)
    • Both
    • ...full creation! (it takes time...)
  • Relational machine learning model
    • Multimode
    • Timestamp in most of events and entities
  • Heuristics to determine the "best projects"
    • Followers
    • Downloads
    • Pull-requests
    • Commits
    • ...

Related Works

  • SAGH: A Social Analysis tool for GitHub
    • A developer recommendation system (and other metrics)
      • http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-015-final.pdf
  • An Analysis of GitHub’s Collaborative Software Network
    • Link Prediction
      • http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-050-final.v01.pdf
  • Network Structure of Social Coding in GitHub
    • http://www.mysmu.edu/faculty/lxjiang/papers/csmr13github.pdf
    • http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=2686&context=sis_research
  • Coding Together at Scale: GitHub as a Collaborative Social Network
    • ​Analysis of the geographical activity
      • ​http://arxiv.org/abs/1407.2535  
      • http://www.cs.bham.ac.uk/~musolesm/papers/icwsm14_github.pdf

Smargit Dataset creation

By Jose María Alvarez

Smargit Dataset creation

  • 1,408