Opensource

&

Tunisia

badreddine@ankaboot.fr

First, let's introduce myself

Are we contributors or consumers?

Study based on GitHub &

Stack Overflow

Key figures

26 million users (march 2017)
57 million repositories
#61 @ Alexa (october 2017)
Yet, not opensource

Key figures

8 million users (march 2017)
#49 @ Alexa (october 2017)
CC-BY-SA 3.0

Tunisia on GitHub & Stack Overflow

1 931 users @ GitHub
552 users @ Stack Overflow
Very few, yet constantly growing (it's up to you)

Top countries by number of pushes Aug 2016

Top countries by number of pushes / population

Tunisia

#1 in Africa

#4 in Arab World

(pop. ratio based ranking)

Let's have a deeper look inside and crunch the data...

Assumptions

every country has the same ratio of opensource/non-free software ("all things being equal" mindset) so we don't check for licences (cause we can't)
GitHub users are relatively representative of the whole opensource community except they're more web focused
Tunisians on github are identified by location on their github profile
a contribution = a merged pull request
documentation, translation, bug fixing contributions are as valuable as code
repository with >1000 stars = important product
value of a contribution = number of repository's stars
score = sums of contribution value + stackoverflow reputation

The Code

GPLv3
Work in progress
python3
GitHub: GraphQL API
Stackoverflow: Stackexchange Query Results Scrapping
Storage: flat file CSV (>Cassandra?)
Goal: gamify opensource contributions on github & stackoverflow using statistics

#!/usr/bin/env python3
"""
Contributors ranking based on their Github merged pull requests and Stackoverflow reputation
Goal: gamify contributions to opensource projects
"""

__author__ = "Bader LEJMI"
__version__ = "0.1.0"
__license__ = "GPLv3"

import requests
import string
import json
import atexit
import datetime
import csv
from collections import OrderedDict

DEFAULT_LOCATION = "Tunisia"
GITHUB_TOKEN = "XXXXXXXXXX"
GITHUB_GRAPHQL_FILENAME = "tunisians.graphql"
GITHUB_GRAPHQL_ENDPOINT = "https://api.github.com/graphql" 
GITHUB_TOP_STARS = 1000
GITHUB_RANKING_FILENAME = "github_ranking.json"
GITHUB_FIRST_YEAR = 2007


STACKOVERFLOW_CSV_URL = "http://data.stackexchange.com/stackoverflow/csv/893492?CountryName=$location"
STACKOVERFLOW_CSV_FILENAME = "so.csv"
STACKOVERFLOW_CSV_HEADER = ['name', 'id', 'avatarUrl', 'url', 'reputation']

GITHUB_STACKOVERFLOW_KEYS = OrderedDict([('url','url'), ('websiteUrl','url'), ('name','name'), ('login','name')])

def _atexit_save(oRanking):
    """
    agnostic autosave hooker
    """
    oRanking.save()
    
"""class platformRanking:
    pass
"""

class Ranking:
    """
    union
    """
    
    def __init__(self, github, stackoverflow):
        self.so = stackoverflow
        self.gh = github
        self.users = self.gh.users
        #self.join()
        
        """ 
        #old costly algorithm
        def join(self, keys=GITHUB_STACKOVERFLOW_KEYS):
        joined = []
        
        for u in range(len(self.users)):
            for kgh,kso in keys.items():
                #look for same value key by key in all stackoverflow users
                for uso in self.so.users:
                    try:
                        if uso[kso] and self.users[u][kgh].lower() == uso[kso].lower():
                            self.users[u].update({k:v for k,v in uso.items() if not self.users[u].get(k)})
                            joined.append(self.users[u]['login'])
                            #print("kgh: %s, value: %s, uso: %s, user: %s" % (kgh, self.users[u][kgh],uso,self.users[u]))
                            #input()
                            break
                    except AttributeError:
                        pass
                #exit loop if user already joined
                if joined and joined[-1]==self.users[u]['login']:
                    break
        
        return joined
        """  
        
    def join(self, keys=GITHUB_STACKOVERFLOW_KEYS):
        """
        join github and stackoverflow users
        """
        joined = []
        joined_keys = {kgh:{} for kgh in keys}
        
        """
        first index all GH users,
        secondly test if SO user is joinable
        then update
        """
        i=0
        len_users=len(self.users)
        for u in self.users+self.so.users:
            for kgh,kso in keys.items():
                #all gh unindexed users with a value for key must be indexed
                #try:
                if i<len_users:
                    try:
                        joined_keys[u[kgh].lower()] = u
                    except AttributeError:
                        pass #sometimes u[kgh] is "null" in JSON so None in Python
                    except KeyError:
                        print(i,u)
                        input()
                    continue
                    
                ukso = u[kso].lower()
                #then, only so users found in gh users are updated (only once)
                try:
                    if ukso and joined_keys[ukso][kgh].lower()==ukso:
                        """
                        HACK
                        SO name is sometimes a real name, sometimes a firstname, and sometimes a GH login due to OAuth SSO
                        therefore:
                        1) GH/SO name can't be used as a join id when it's just a firstname
                        2) SO name should be joined to GH login when it's relevant
                        Hence, my hypothesis:
                        I) GH name is usable as a join id when
                        a) SO name is a real name (>=2 words) OR
                        b) GH name is GH login (equal to GH login minus spaces forbidden in GH login)
                        too restrictive> II) GH login is usable as a join id when GH login is GH name (minus spaces & case)
                        """
                        if kgh=='name' and (len(ukso.split(' '))<2 or ukso.replace(' ','')!=joined_keys[ukso]["login"].lower().replace(' ','')):
                            pass #do not join cause it's neither a complete name nor a login
                        #elif kgh=="login" and u["name"].lower().replace(' ','')!=joined_keys[ukso]["name"].lower().replace(' ',''):
                        #    pass #do not join SO name with GH login if SO name different from GH name
                        else:#here we go^W join
                            update = {k:v for k,v in u.items() if v and not joined_keys[ukso].get(k)}
                            joined_keys[ukso].update(update)
                            joined.append(joined_keys[ukso])
                            break
         
                except KeyError:
                    pass            
            i+=1
            
        return joined
    
    def build_stats(self):
        scores = {}
        for u in self.users:
            score = u.get('reputation',0)+self.gh.score[u["login"]]
            u["score"] = score
            try:
                scores[score].append(u["login"])
            except KeyError:
                scores[score] = [u["login"]]
                
        return scores
    
    
class stackoverflowRanking:
    """
    Stackoverflow ranking
    """
    #Stackoverflow ranking
    
    def __init__(self, location=DEFAULT_LOCATION, autoload=True, autosave=True):
        self.users = []
        self.location = location
        
        if autoload:
            try:
                self.load()
            except FileNotFoundError:
                self.fetch()
        else:
            self.fetch()
        if autosave:
            atexit.register(_atexit_save, self)
    
    def fetch(self, csv_url=STACKOVERFLOW_CSV_URL, header=STACKOVERFLOW_CSV_HEADER):
        csv_url = csv_url.replace("$location", self.location)
        data_csv = requests.get(csv_url).text.splitlines()
        #data_csv = data_csv.splitlines()
        self._fill(data_csv, header)
    
    def load(self, csv_filename=STACKOVERFLOW_CSV_FILENAME, header=STACKOVERFLOW_CSV_HEADER):
        self._fill(open(csv_filename), header)
                
    def _fill(self, iterable, header):
        reader = csv.DictReader(iterable, header)
        next(reader) #skip headers
        for r in reader:
            self.users.append({
                "id" : r["id"],
                "name" : r["name"],
                "avatarUrl": r["avatarUrl"],
                "url": r["url"],
                "reputation": int(r["reputation"])
            })
    
    def save(self, filename=STACKOVERFLOW_CSV_FILENAME, header=STACKOVERFLOW_CSV_HEADER):
        """
        save in a proper CSV file
        """
        so_csv = csv.DictWriter(open(filename, "w"), header)
        so_csv.writerow({h:h for h in header})
        for user in self.users:
            so_csv.writerow(user)
            
    
    def build_stats(self):
        self.scores = {}
        for u in self.users:
            try:
                self.scores[u['reputation']].add(u['id'])
            except KeyError:
                self.scores[u['reputation']] = set([u['id']])

class githubRanking:
    """
    Github ranking
    """

    def __init__(self, token=GITHUB_TOKEN, location=DEFAULT_LOCATION, autoload=True, autosave=True, api_endpoint=GITHUB_GRAPHQL_ENDPOINT, request=GITHUB_GRAPHQL_FILENAME):
        self.users = []
        self.token = token
        self.graphql_request_file = request
        self.graphql_api_endpoint = api_endpoint
        
        #issue: search query is limited to 1000 results
        #solution: paging year by year to limit results
        #limit: if user creation is >1000 a year
        #solution for the future: paging month by month
        self.created_date_range = range(GITHUB_FIRST_YEAR, datetime.datetime.now().year+1)

        self.graphql_variables = {
         "location" : location,   
         "cursor" : '',
         "created_date": self.created_date_range[0],
        }
                    
        if autoload:
            try:
                self.load()
            except FileNotFoundError:
                self.fetch()
        else:
            self.fetch()
        if autosave:
            atexit.register(_atexit_save, self)
            
    def _graphql_query(self, created_date, cursor):
        """
        build an HTTP ready graphql query
        """
        self.graphql_variables['cursor'] = cursor
        self.graphql_variables['created_date'] = created_date
        
        try:
            graphql_query = string.Template(self.graphql_query_content)
        except AttributeError:
            self.graphql_query_content = open(self.graphql_request_file).read()
            self.graphql_query_content = '{"query": "%s"}' % self.graphql_query_content.replace('"',"\\\"").replace('\n','\\n')
            graphql_query = string.Template(self.graphql_query_content)
        
        graphql_query_json = graphql_query.substitute(self.graphql_variables)
        
        return graphql_query_json

    def fetch(self):
        """
        fetch all users data
        """        
        for created_date in self.created_date_range:
            hasNextPage = True
            cursor = ''
            while hasNextPage:
                #currently fetch only first 100 pull requests
                data_json = requests.post(self.graphql_api_endpoint,
                                          headers={'Authorization': "bearer %s " % self.token},
                                          data=self._graphql_query(created_date, cursor))
                try:
                    data = data_json.json()["data"]["location_users"]
                except TypeError:
                    raise TypeError("GraphQL query: %s\nAnswer: %s" % (self._graphql_query("users", created_date, cursor), data_json.text))
                
                self.users.extend([u['user'] for u in data["users"] if u['user']])
    
                hasNextPage = data["pageInfo"]["hasNextPage"]
                #print("created_date: %s %s, hasNextPage: %s, # users: %s" % (created_date, cursor, hasNextPage, len(data["users"])))

                cursor = ', after: \\"%s\\"' % data["pageInfo"]["endCursor"]
                
   
        return self.users
    
    def build_stats(self):
        """
        stats on repositories, users, languages
        """
        self.repositories = {}
        self.score = {}
        self.languages = {}
         
        for user in self.users:
            try:                
                id = user["login"]
                self.score[id] = 0
                
                for pull in user["pullRequests"]["pullMerged"]:
                    repo = pull["repository"]
                    
                    self.score[id] += repo["stargazers"]["totalCount"]
                    primaryLanguage = repo["primaryLanguage"]["name"] if repo["primaryLanguage"] else None
                    
                    try:
                        self.languages[primaryLanguage] += repo["stargazers"]["totalCount"]
                    except KeyError:
                        self.languages[primaryLanguage] = repo["stargazers"]["totalCount"]
                    
                    if not repo["name"] in self.repositories:
                        self.repositories[repo["name"]] = {
                            "stargazers": repo["stargazers"]["totalCount"],
                            "primaryLanguage": primaryLanguage,
                            "contributors": set([id])
                        }
                        user["repositories"] = set([repo["name"]])
                    else:
                        self.repositories[repo["name"]]["contributors"].add(id)
                        user["repositories"].add(repo["name"])
                    

            except KeyError:
                pass
        
        return self.score
    
    def load(self, method="filesystem", filename=GITHUB_RANKING_FILENAME):
        """
        load data from storage (current default and only method filesystem) in json format
        """
        file = open(filename, "r")
        self.users = json.load(file)
        
    def save(self, filename=GITHUB_RANKING_FILENAME):
        """
        save data in a storage (current default and only method: filesystem) in json format
        """
        if not self.users:
            return None
        try:
            file = open(filename, "x")
        except FileExistsError:
            file = open(filename, "w")

        json.dump(self.users, file, indent="\t", ensure_ascii=False)
   
if __name__ == "__main__":
    """ This is executed when run from the command line """
    gh = githubRanking(autoload=True, autosave=False)
    gh.build_stats()
    so = stackoverflowRanking(autoload=True, autosave=False)
    so.build_stats()
    r = Ranking(github=gh, stackoverflow=so)
    r.join()
    scores_users = r.build_stats()
    scores = list(scores_users.keys())
    scores.sort(reverse=True)
    top_users_repo = {}
    for i in range(min(100,len(scores))):
        users = scores_users[scores[i]]
        #print("%s: %s" % (scores[i], users))
        for user in users:
            top_users_repo[user] = [repo for repo,v in gh.repositories.items() if user in v["contributors"] and v["stargazers"]>=GITHUB_TOP_STARS]
    
    header = ["user", "repositories"]
    top_users_repo_csv = csv.DictWriter(open("top_users_repo.csv", "w"), header)
    top_users_repo_csv.writerow({h:h for h in header})
    for user,v in top_users_repo.items():
        top_users_repo_csv.writerow({'user':user, 'repositories':';'.join(v)})

Wait a minute...

GraphQL?

REST

every URL is an object
HTTP verb are methods (PUT, GET, POST, DEL)
Web as an API

GraphQL

Only one endpoint
Query only what you want, get only your results
Pagination included
Less verbose
Designed by Facebook, implemented by Github

GraphQL Query

{
  location_users: search(type: USER, query: "location:$location created:${created_date}", first: 100 $cursor) {
    users: edges {
      user: node {
        ... on User {
          login
          name
          email
          avatarUrl
          url
          websiteUrl
  
          pullRequests(first: 100, orderBy: {field: CREATED_AT, direction: DESC})
          {
            totalCount
            pullMerged: nodes{
                ... on PullRequest
                {
                    url
                    createdAt
                    repository
                    {
                     name
                      primaryLanguage {
                        name
                      }
                      stargazers {
                        totalCount
                      }
                    }
                }
            }
          }
        }
      }
    }
    pageInfo {
        endCursor
        hasNextPage
    }
    
  }
}

Now, the results...

Contribution by language

100 top contributors

Key figures

Less than 1 contribution per user
the 100 top users represents 77,2% contributions (1482 / 1920 contributions) and almost 100% of stars (major projects)
the top 10 users represents >1/3 of all contributions (617)

Some thoughts

minio fabulous project explains GO enigma
other tunisian-made opensource products (eg. bigbluebutton...)
contributions to famous or promising products (invoiceninja, odooo, prestashop) and framework (django, symfony, drupal, Apache Spark, Apache Kafka)
why those major tunisian opensource contributors don't do more techtalks to encourage others?

Roadmap

mapping network relationship (follow)
auto-discovery of latent relationship, projects to foster local collaboration
...

Next steps

release tunisians.py, engage the community.tn, improve, contribute more, iterate

Thank you for your attention

bibliography

Wikimedia / Wikipedia
Felipe Hoffa, https://medium.com/@hoffa/the-top-github-projects-per-country-92c275e19409 https://medium.com/@hoffa/github-top-countries-201608-13f642493773
GitHub
Stackoverflow
M. D.

Opensource business models

By Badreddine Ladjemi

Opensource business models

Opensource

&

Tunisia

First, let's introduce myself

Are we contributors or consumers?

Study based on GitHub &

Stack Overflow

Key figures

Key figures

Tunisia on GitHub & Stack Overflow

Top countries by number of pushes Aug 2016

Top countries by number of pushes / population

Tunisia

#1 in Africa

#4 in Arab World

(pop. ratio based ranking)

Let's have a deeper look inside and crunch the data...

Assumptions

The Code

Wait a minute...

GraphQL?

REST

GraphQL

GraphQL Query

Now, the results...

Contribution by language

100 top contributors

Key figures

Some thoughts

Roadmap

Next steps

release tunisians.py, engage the community.tn, improve, contribute more, iterate

Thank you for your attention

bibliography

Opensource business models

More from Badreddine Ladjemi