of Mining

github

Promises

&

Perils

by Amir Saboury

asdfasdf

The Promises and Perils of Mining GitHub

Eirini Kalliamvakou

 University of Victoria

ikaliam@uvic.ca

Leif Singer

 University of Victoria

lsinger@uvic.ca

Georgios Gousios

 Delft University of Technology

G.Gousios@tudelft.nl

Daniel M. German

 University of Victoria

dmg@uvic.ca

Kelly Blincoe

 University of Victoria

kblincoe@acm.org

Daniela Damian

 University of Victoria

danielad@cs.uvic.ca

Why do we care?

  • Over 10 million git repositories
  • Integrated social features

    • Issues

    • Pull Requests

    • Code Reviews

    • ...

  • Accessible API

How?

  • Surveyed GitHub users, conducted interviews
  • Quantitative & Qualitative analysis

    • GHtorrent

    • In-depth analysis of 434 projects

Peril I

A repository is not necessarily a project.

4,111 base repos have been forked at least 100 times.

Avoidance Sterategy

Consider the activity in both the base & all forked repositories.

Peril II

Most projects have very few commits.

  • Commits/project = 6
  • 90% have less than 50 commits.

Peril III

Most projects are inactive.

Cumulative ratio of active projects during the last n months since Jan 9, 2014.

46% of projects have been inactive in the last 6 months.

Only 13% of projects were active in the last month &

1/3 of them were created during that period.

The median number of days a project is active

9.9 days

25% of projects at 100 or more days; 32% have activity less than one day.

Avoidance Sterategy

Consider the number of recent commits & pull requests.

Peril IV

A large portion of repositories are not for software development.

14%

experimentation, hosting websites, academic/class projects

10%

specically for storage.

Avoidance Sterategy

Review the description & README file.

Peril V

2/3 of projects are personal.

38%

primarily for their own projects

2.9%

commits have an author who is not its committer.

Avoidance Sterategy

The number of committers should be considered.

Promise I

Valuable source of data

for the study of code reviews.

Peril VI

Only a fraction of projects use pull requests.

2.6m

collaborative projects

10%

used the pull request model at least once

2.4m are using shared repository model

Avoidance Sterategy

The #PR must be considered for researching the code review process.

Peril VII

GitHub records only the commits that are the result of the peer-review.

GitHub does record the intermediate commits, it does not report them through the API as part of the pull request.

Avoidance Sterategy

Not rely on the commits reported by GitHub.

Peril VIII

Most pull requests appear as non-merged.

even if they are actually merged.

  1. Through GitHub facilities, using the merge button
  2. Using git, by merging the main repository branch and the pull request branch.
  3. By creating a textual patch between the pull request and main repository branches and applying to the master branch.

44%

of all the PRs are reported as merged.

H1

At least one of the commits in the PR appears in the project's master branch.

H2

commit closes the PR by its log and that commit appears in the master branch.

H3

One of the last 3 discussion comments contain a commit unique ID that appears in the project's master branch and the message should match (?:merg|appl|pull|push|integrat)(?:ing|i?ed)

H4

The latest comment prior to closing the pull request matches the regular expression above.

H1 (32%), H2 (1%), H3 (5%), H4 (4%)

79%

37%

of all the PRs in the sample projects are reported as merged.

Avoidance Sterategy

Not rely on GitHub's merge status.

Promise II

Interlinking of developers, pull requests, issues & commits.

GitHub automatically extracts references and presents them as part of the discussion flow.

Peril IX

Many active projects do not conduct all their software development in GitHub.

Avoidance Sterategy

Avoid projects that have a high number of non-registered committers & mirror projects.

Threats to Validity

  • Low number & self-selected participants in the survey.
  • Manual exploration of 434 projects.
  • Reliability of the GHtorrent dataset.
  • Accuracy of the heuristics.

Threats to Validity

Thank you

Made with Slides.com