Opensource
&
Tunisia
badreddine@ankaboot.fr
First, let's introduce myself
Are we contributors or consumers?
Study based on GitHub &
Stack Overflow
Key figures
- 26 million users (march 2017)
- 57 million repositories
- #61 @ Alexa (october 2017)
- Yet, not opensource
Key figures
- 8 million users (march 2017)
- #49 @ Alexa (october 2017)
- CC-BY-SA 3.0
Tunisia on GitHub & Stack Overflow
- 1 931 users @ GitHub
- 552 users @ Stack Overflow
- Very few, yet constantly growing (it's up to you)
Top countries by number of pushes Aug 2016
Top countries by number of pushes / population
Tunisia
#1 in Africa
#4 in Arab World
(pop. ratio based ranking)
Let's have a deeper look inside and crunch the data...
Assumptions
- every country has the same ratio of opensource/non-free software ("all things being equal" mindset) so we don't check for licences (cause we can't)
- GitHub users are relatively representative of the whole opensource community except they're more web focused
- Tunisians on github are identified by location on their github profile
- a contribution = a merged pull request
- documentation, translation, bug fixing contributions are as valuable as code
- repository with >1000 stars = important product
- value of a contribution = number of repository's stars
- score = sums of contribution value + stackoverflow reputation
The Code
- GPLv3
- Work in progress
- python3
- GitHub: GraphQL API
- Stackoverflow: Stackexchange Query Results Scrapping
- Storage: flat file CSV (>Cassandra?)
- Goal: gamify opensource contributions on github & stackoverflow using statistics
#!/usr/bin/env python3
"""
Contributors ranking based on their Github merged pull requests and Stackoverflow reputation
Goal: gamify contributions to opensource projects
"""
__author__ = "Bader LEJMI"
__version__ = "0.1.0"
__license__ = "GPLv3"
import requests
import string
import json
import atexit
import datetime
import csv
from collections import OrderedDict
DEFAULT_LOCATION = "Tunisia"
GITHUB_TOKEN = "XXXXXXXXXX"
GITHUB_GRAPHQL_FILENAME = "tunisians.graphql"
GITHUB_GRAPHQL_ENDPOINT = "https://api.github.com/graphql"
GITHUB_TOP_STARS = 1000
GITHUB_RANKING_FILENAME = "github_ranking.json"
GITHUB_FIRST_YEAR = 2007
STACKOVERFLOW_CSV_URL = "http://data.stackexchange.com/stackoverflow/csv/893492?CountryName=$location"
STACKOVERFLOW_CSV_FILENAME = "so.csv"
STACKOVERFLOW_CSV_HEADER = ['name', 'id', 'avatarUrl', 'url', 'reputation']
GITHUB_STACKOVERFLOW_KEYS = OrderedDict([('url','url'), ('websiteUrl','url'), ('name','name'), ('login','name')])
def _atexit_save(oRanking):
"""
agnostic autosave hooker
"""
oRanking.save()
"""class platformRanking:
pass
"""
class Ranking:
"""
union
"""
def __init__(self, github, stackoverflow):
self.so = stackoverflow
self.gh = github
self.users = self.gh.users
#self.join()
"""
#old costly algorithm
def join(self, keys=GITHUB_STACKOVERFLOW_KEYS):
joined = []
for u in range(len(self.users)):
for kgh,kso in keys.items():
#look for same value key by key in all stackoverflow users
for uso in self.so.users:
try:
if uso[kso] and self.users[u][kgh].lower() == uso[kso].lower():
self.users[u].update({k:v for k,v in uso.items() if not self.users[u].get(k)})
joined.append(self.users[u]['login'])
#print("kgh: %s, value: %s, uso: %s, user: %s" % (kgh, self.users[u][kgh],uso,self.users[u]))
#input()
break
except AttributeError:
pass
#exit loop if user already joined
if joined and joined[-1]==self.users[u]['login']:
break
return joined
"""
def join(self, keys=GITHUB_STACKOVERFLOW_KEYS):
"""
join github and stackoverflow users
"""
joined = []
joined_keys = {kgh:{} for kgh in keys}
"""
first index all GH users,
secondly test if SO user is joinable
then update
"""
i=0
len_users=len(self.users)
for u in self.users+self.so.users:
for kgh,kso in keys.items():
#all gh unindexed users with a value for key must be indexed
#try:
if i<len_users:
try:
joined_keys[u[kgh].lower()] = u
except AttributeError:
pass #sometimes u[kgh] is "null" in JSON so None in Python
except KeyError:
print(i,u)
input()
continue
ukso = u[kso].lower()
#then, only so users found in gh users are updated (only once)
try:
if ukso and joined_keys[ukso][kgh].lower()==ukso:
"""
HACK
SO name is sometimes a real name, sometimes a firstname, and sometimes a GH login due to OAuth SSO
therefore:
1) GH/SO name can't be used as a join id when it's just a firstname
2) SO name should be joined to GH login when it's relevant
Hence, my hypothesis:
I) GH name is usable as a join id when
a) SO name is a real name (>=2 words) OR
b) GH name is GH login (equal to GH login minus spaces forbidden in GH login)
too restrictive> II) GH login is usable as a join id when GH login is GH name (minus spaces & case)
"""
if kgh=='name' and (len(ukso.split(' '))<2 or ukso.replace(' ','')!=joined_keys[ukso]["login"].lower().replace(' ','')):
pass #do not join cause it's neither a complete name nor a login
#elif kgh=="login" and u["name"].lower().replace(' ','')!=joined_keys[ukso]["name"].lower().replace(' ',''):
# pass #do not join SO name with GH login if SO name different from GH name
else:#here we go^W join
update = {k:v for k,v in u.items() if v and not joined_keys[ukso].get(k)}
joined_keys[ukso].update(update)
joined.append(joined_keys[ukso])
break
except KeyError:
pass
i+=1
return joined
def build_stats(self):
scores = {}
for u in self.users:
score = u.get('reputation',0)+self.gh.score[u["login"]]
u["score"] = score
try:
scores[score].append(u["login"])
except KeyError:
scores[score] = [u["login"]]
return scores
class stackoverflowRanking:
"""
Stackoverflow ranking
"""
#Stackoverflow ranking
def __init__(self, location=DEFAULT_LOCATION, autoload=True, autosave=True):
self.users = []
self.location = location
if autoload:
try:
self.load()
except FileNotFoundError:
self.fetch()
else:
self.fetch()
if autosave:
atexit.register(_atexit_save, self)
def fetch(self, csv_url=STACKOVERFLOW_CSV_URL, header=STACKOVERFLOW_CSV_HEADER):
csv_url = csv_url.replace("$location", self.location)
data_csv = requests.get(csv_url).text.splitlines()
#data_csv = data_csv.splitlines()
self._fill(data_csv, header)
def load(self, csv_filename=STACKOVERFLOW_CSV_FILENAME, header=STACKOVERFLOW_CSV_HEADER):
self._fill(open(csv_filename), header)
def _fill(self, iterable, header):
reader = csv.DictReader(iterable, header)
next(reader) #skip headers
for r in reader:
self.users.append({
"id" : r["id"],
"name" : r["name"],
"avatarUrl": r["avatarUrl"],
"url": r["url"],
"reputation": int(r["reputation"])
})
def save(self, filename=STACKOVERFLOW_CSV_FILENAME, header=STACKOVERFLOW_CSV_HEADER):
"""
save in a proper CSV file
"""
so_csv = csv.DictWriter(open(filename, "w"), header)
so_csv.writerow({h:h for h in header})
for user in self.users:
so_csv.writerow(user)
def build_stats(self):
self.scores = {}
for u in self.users:
try:
self.scores[u['reputation']].add(u['id'])
except KeyError:
self.scores[u['reputation']] = set([u['id']])
class githubRanking:
"""
Github ranking
"""
def __init__(self, token=GITHUB_TOKEN, location=DEFAULT_LOCATION, autoload=True, autosave=True, api_endpoint=GITHUB_GRAPHQL_ENDPOINT, request=GITHUB_GRAPHQL_FILENAME):
self.users = []
self.token = token
self.graphql_request_file = request
self.graphql_api_endpoint = api_endpoint
#issue: search query is limited to 1000 results
#solution: paging year by year to limit results
#limit: if user creation is >1000 a year
#solution for the future: paging month by month
self.created_date_range = range(GITHUB_FIRST_YEAR, datetime.datetime.now().year+1)
self.graphql_variables = {
"location" : location,
"cursor" : '',
"created_date": self.created_date_range[0],
}
if autoload:
try:
self.load()
except FileNotFoundError:
self.fetch()
else:
self.fetch()
if autosave:
atexit.register(_atexit_save, self)
def _graphql_query(self, created_date, cursor):
"""
build an HTTP ready graphql query
"""
self.graphql_variables['cursor'] = cursor
self.graphql_variables['created_date'] = created_date
try:
graphql_query = string.Template(self.graphql_query_content)
except AttributeError:
self.graphql_query_content = open(self.graphql_request_file).read()
self.graphql_query_content = '{"query": "%s"}' % self.graphql_query_content.replace('"',"\\\"").replace('\n','\\n')
graphql_query = string.Template(self.graphql_query_content)
graphql_query_json = graphql_query.substitute(self.graphql_variables)
return graphql_query_json
def fetch(self):
"""
fetch all users data
"""
for created_date in self.created_date_range:
hasNextPage = True
cursor = ''
while hasNextPage:
#currently fetch only first 100 pull requests
data_json = requests.post(self.graphql_api_endpoint,
headers={'Authorization': "bearer %s " % self.token},
data=self._graphql_query(created_date, cursor))
try:
data = data_json.json()["data"]["location_users"]
except TypeError:
raise TypeError("GraphQL query: %s\nAnswer: %s" % (self._graphql_query("users", created_date, cursor), data_json.text))
self.users.extend([u['user'] for u in data["users"] if u['user']])
hasNextPage = data["pageInfo"]["hasNextPage"]
#print("created_date: %s %s, hasNextPage: %s, # users: %s" % (created_date, cursor, hasNextPage, len(data["users"])))
cursor = ', after: \\"%s\\"' % data["pageInfo"]["endCursor"]
return self.users
def build_stats(self):
"""
stats on repositories, users, languages
"""
self.repositories = {}
self.score = {}
self.languages = {}
for user in self.users:
try:
id = user["login"]
self.score[id] = 0
for pull in user["pullRequests"]["pullMerged"]:
repo = pull["repository"]
self.score[id] += repo["stargazers"]["totalCount"]
primaryLanguage = repo["primaryLanguage"]["name"] if repo["primaryLanguage"] else None
try:
self.languages[primaryLanguage] += repo["stargazers"]["totalCount"]
except KeyError:
self.languages[primaryLanguage] = repo["stargazers"]["totalCount"]
if not repo["name"] in self.repositories:
self.repositories[repo["name"]] = {
"stargazers": repo["stargazers"]["totalCount"],
"primaryLanguage": primaryLanguage,
"contributors": set([id])
}
user["repositories"] = set([repo["name"]])
else:
self.repositories[repo["name"]]["contributors"].add(id)
user["repositories"].add(repo["name"])
except KeyError:
pass
return self.score
def load(self, method="filesystem", filename=GITHUB_RANKING_FILENAME):
"""
load data from storage (current default and only method filesystem) in json format
"""
file = open(filename, "r")
self.users = json.load(file)
def save(self, filename=GITHUB_RANKING_FILENAME):
"""
save data in a storage (current default and only method: filesystem) in json format
"""
if not self.users:
return None
try:
file = open(filename, "x")
except FileExistsError:
file = open(filename, "w")
json.dump(self.users, file, indent="\t", ensure_ascii=False)
if __name__ == "__main__":
""" This is executed when run from the command line """
gh = githubRanking(autoload=True, autosave=False)
gh.build_stats()
so = stackoverflowRanking(autoload=True, autosave=False)
so.build_stats()
r = Ranking(github=gh, stackoverflow=so)
r.join()
scores_users = r.build_stats()
scores = list(scores_users.keys())
scores.sort(reverse=True)
top_users_repo = {}
for i in range(min(100,len(scores))):
users = scores_users[scores[i]]
#print("%s: %s" % (scores[i], users))
for user in users:
top_users_repo[user] = [repo for repo,v in gh.repositories.items() if user in v["contributors"] and v["stargazers"]>=GITHUB_TOP_STARS]
header = ["user", "repositories"]
top_users_repo_csv = csv.DictWriter(open("top_users_repo.csv", "w"), header)
top_users_repo_csv.writerow({h:h for h in header})
for user,v in top_users_repo.items():
top_users_repo_csv.writerow({'user':user, 'repositories':';'.join(v)})
Wait a minute...
GraphQL?
REST
- every URL is an object
- HTTP verb are methods (PUT, GET, POST, DEL)
- Web as an API
GraphQL
- Only one endpoint
- Query only what you want, get only your results
- Pagination included
- Less verbose
- Designed by Facebook, implemented by Github
GraphQL Query
{
location_users: search(type: USER, query: "location:$location created:${created_date}", first: 100 $cursor) {
users: edges {
user: node {
... on User {
login
name
email
avatarUrl
url
websiteUrl
pullRequests(first: 100, orderBy: {field: CREATED_AT, direction: DESC})
{
totalCount
pullMerged: nodes{
... on PullRequest
{
url
createdAt
repository
{
name
primaryLanguage {
name
}
stargazers {
totalCount
}
}
}
}
}
}
}
}
pageInfo {
endCursor
hasNextPage
}
}
}
Now, the results...
Contribution by language
100 top contributors
Key figures
- Less than 1 contribution per user
- the 100 top users represents 77,2% contributions (1482 / 1920 contributions) and almost 100% of stars (major projects)
- the top 10 users represents >1/3 of all contributions (617)
Some thoughts
- minio fabulous project explains GO enigma
- other tunisian-made opensource products (eg. bigbluebutton...)
- contributions to famous or promising products (invoiceninja, odooo, prestashop) and framework (django, symfony, drupal, Apache Spark, Apache Kafka)
- why those major tunisian opensource contributors don't do more techtalks to encourage others?
Roadmap
- mapping network relationship (follow)
- auto-discovery of latent relationship, projects to foster local collaboration
- ...
Next steps
release tunisians.py, engage the community.tn, improve, contribute more, iterate
Thank you for your attention
bibliography
- Wikimedia / Wikipedia
- Felipe Hoffa, https://medium.com/@hoffa/the-top-github-projects-per-country-92c275e19409 https://medium.com/@hoffa/github-top-countries-201608-13f642493773
- GitHub
- Stackoverflow
- M. D.
Opensource business models
By Badreddine Ladjemi
Opensource business models
- 654