of Mining
Promises
&
Perils
by Amir Saboury
asdfasdf
The Promises and Perils of Mining GitHub
Eirini Kalliamvakou
University of Victoria
ikaliam@uvic.ca
Leif Singer
University of Victoria
lsinger@uvic.ca
Georgios Gousios
Delft University of Technology
G.Gousios@tudelft.nl
Daniel M. German
University of Victoria
dmg@uvic.ca
Kelly Blincoe
University of Victoria
kblincoe@acm.org
Daniela Damian
University of Victoria
danielad@cs.uvic.ca
Integrated social features
Issues
Pull Requests
Code Reviews
...
Accessible API
Quantitative & Qualitative analysis
GHtorrent
In-depth analysis of 434 projects
A repository is not necessarily a project.
4,111 base repos have been forked at least 100 times.
Consider the activity in both the base & all forked repositories.
Most projects have very few commits.
Most projects are inactive.
Cumulative ratio of active projects during the last n months since Jan 9, 2014.
46% of projects have been inactive in the last 6 months.
Only 13% of projects were active in the last month &
1/3 of them were created during that period.
The median number of days a project is active
25% of projects at 100 or more days; 32% have activity less than one day.
Consider the number of recent commits & pull requests.
A large portion of repositories are not for software development.
experimentation, hosting websites, academic/class projects
specically for storage.
Review the description & README file.
2/3 of projects are personal.
primarily for their own projects
commits have an author who is not its committer.
The number of committers should be considered.
Valuable source of data
for the study of code reviews.
Only a fraction of projects use pull requests.
collaborative projects
used the pull request model at least once
2.4m are using shared repository model
The #PR must be considered for researching the code review process.
GitHub records only the commits that are the result of the peer-review.
GitHub does record the intermediate commits, it does not report them through the API as part of the pull request.
Not rely on the commits reported by GitHub.
Most pull requests appear as non-merged.
even if they are actually merged.
44%
of all the PRs are reported as merged.
At least one of the commits in the PR appears in the project's master branch.
commit closes the PR by its log and that commit appears in the master branch.
One of the last 3 discussion comments contain a commit unique ID that appears in the project's master branch and the message should match (?:merg|appl|pull|push|integrat)(?:ing|i?ed)
The latest comment prior to closing the pull request matches the regular expression above.
79%
37%
of all the PRs in the sample projects are reported as merged.
Not rely on GitHub's merge status.
Interlinking of developers, pull requests, issues & commits.
GitHub automatically extracts references and presents them as part of the discussion flow.
Many active projects do not conduct all their software development in GitHub.
Avoid projects that have a high number of non-registered committers & mirror projects.
Thank you