of Mining
github
Promises
&
Perils
by Amir Saboury
asdfasdf
The Promises and Perils of Mining GitHub
Eirini Kalliamvakou
University of Victoria
ikaliam@uvic.ca
Leif Singer
University of Victoria
lsinger@uvic.ca
Georgios Gousios
Delft University of Technology
G.Gousios@tudelft.nl
Daniel M. German
University of Victoria
dmg@uvic.ca
Kelly Blincoe
University of Victoria
kblincoe@acm.org
Daniela Damian
University of Victoria
danielad@cs.uvic.ca
Why do we care?
- Over 10 million git repositories
-
Integrated social features
-
Issues
-
Pull Requests
-
Code Reviews
-
...
-
-
Accessible API
How?
- Surveyed GitHub users, conducted interviews
-
Quantitative & Qualitative analysis
-
GHtorrent
-
In-depth analysis of 434 projects
-
Peril I
A repository is not necessarily a project.
4,111 base repos have been forked at least 100 times.
Avoidance Sterategy
Consider the activity in both the base & all forked repositories.
Peril II
Most projects have very few commits.
- Commits/project = 6
- 90% have less than 50 commits.
Peril III
Most projects are inactive.
Cumulative ratio of active projects during the last n months since Jan 9, 2014.
46% of projects have been inactive in the last 6 months.
Only 13% of projects were active in the last month &
1/3 of them were created during that period.
The median number of days a project is active
9.9 days
25% of projects at 100 or more days; 32% have activity less than one day.
Avoidance Sterategy
Consider the number of recent commits & pull requests.
Peril IV
A large portion of repositories are not for software development.
14%
experimentation, hosting websites, academic/class projects
10%
specically for storage.
Avoidance Sterategy
Review the description & README file.
Peril V
2/3 of projects are personal.
38%
primarily for their own projects
2.9%
commits have an author who is not its committer.
Avoidance Sterategy
The number of committers should be considered.
Promise I
Valuable source of data
for the study of code reviews.
Peril VI
Only a fraction of projects use pull requests.
2.6m
collaborative projects
10%
used the pull request model at least once
2.4m are using shared repository model
Avoidance Sterategy
The #PR must be considered for researching the code review process.
Peril VII
GitHub records only the commits that are the result of the peer-review.
GitHub does record the intermediate commits, it does not report them through the API as part of the pull request.
Avoidance Sterategy
Not rely on the commits reported by GitHub.
Peril VIII
Most pull requests appear as non-merged.
even if they are actually merged.
- Through GitHub facilities, using the merge button
- Using git, by merging the main repository branch and the pull request branch.
- By creating a textual patch between the pull request and main repository branches and applying to the master branch.
44%
of all the PRs are reported as merged.
H1
At least one of the commits in the PR appears in the project's master branch.
H2
commit closes the PR by its log and that commit appears in the master branch.
H3
One of the last 3 discussion comments contain a commit unique ID that appears in the project's master branch and the message should match (?:merg|appl|pull|push|integrat)(?:ing|i?ed)
H4
The latest comment prior to closing the pull request matches the regular expression above.
H1 (32%), H2 (1%), H3 (5%), H4 (4%)
79%
37%
of all the PRs in the sample projects are reported as merged.
Avoidance Sterategy
Not rely on GitHub's merge status.
Promise II
Interlinking of developers, pull requests, issues & commits.
GitHub automatically extracts references and presents them as part of the discussion flow.
Peril IX
Many active projects do not conduct all their software development in GitHub.
Avoidance Sterategy
Avoid projects that have a high number of non-registered committers & mirror projects.
Threats to Validity
- Low number & self-selected participants in the survey.
- Manual exploration of 434 projects.
- Reliability of the GHtorrent dataset.
- Accuracy of the heuristics.
Threats to Validity
Thank you
The Promises and Perils of Mining GitHub
By amir
The Promises and Perils of Mining GitHub
- 1,400