Opensource

&

Tunisia

badreddine@ankaboot.fr

First, let's introduce myself

Are we contributors or consumers?

Study based on GitHub &

Stack Overflow

Key figures

  • 26 million users (march 2017)
  • 57 million repositories
  • #61 @ Alexa  (october 2017)
  • Yet, not opensource

 

Key figures

  • 8 million users (march 2017)
  • #49 @ Alexa  (october 2017)
  • CC-BY-SA 3.0

Tunisia on GitHub & Stack Overflow

  • 1 931 users @ GitHub
  • 552 users @ Stack Overflow
  • Very few, yet constantly growing (it's up to you)

Top countries by number of pushes Aug 2016

Top countries by number of pushes / population

 

Tunisia

#1 in Africa

#4 in Arab World

(pop. ratio based ranking)

Let's have a deeper look inside and crunch the data...

Assumptions

  • every country has the same ratio of opensource/non-free software ("all things being equal" mindset) so we don't check for licences (cause we can't)
  • GitHub users are relatively representative of the whole opensource community except they're more web focused
  • Tunisians on github are identified by location on their github profile
  • a contribution = a merged pull request
  • documentation, translation, bug fixing contributions are as valuable as code
  • repository with >1000 stars = important product
  • value of a contribution = number of repository's stars
  • score = sums of contribution value + stackoverflow reputation

The Code

  • GPLv3
  • Work in progress
  • python3
  • GitHub: GraphQL API
  • Stackoverflow: Stackexchange Query Results Scrapping
  • Storage: flat file CSV (>Cassandra?)
  • Goal: gamify opensource contributions on github & stackoverflow using statistics
#!/usr/bin/env python3
"""
Contributors ranking based on their Github merged pull requests and Stackoverflow reputation
Goal: gamify contributions to opensource projects
"""

__author__ = "Bader LEJMI"
__version__ = "0.1.0"
__license__ = "GPLv3"

import requests
import string
import json
import atexit
import datetime
import csv
from collections import OrderedDict

DEFAULT_LOCATION = "Tunisia"
GITHUB_TOKEN = "XXXXXXXXXX"
GITHUB_GRAPHQL_FILENAME = "tunisians.graphql"
GITHUB_GRAPHQL_ENDPOINT = "https://api.github.com/graphql" 
GITHUB_TOP_STARS = 1000
GITHUB_RANKING_FILENAME = "github_ranking.json"
GITHUB_FIRST_YEAR = 2007


STACKOVERFLOW_CSV_URL = "http://data.stackexchange.com/stackoverflow/csv/893492?CountryName=$location"
STACKOVERFLOW_CSV_FILENAME = "so.csv"
STACKOVERFLOW_CSV_HEADER = ['name', 'id', 'avatarUrl', 'url', 'reputation']

GITHUB_STACKOVERFLOW_KEYS = OrderedDict([('url','url'), ('websiteUrl','url'), ('name','name'), ('login','name')])

def _atexit_save(oRanking):
    """
    agnostic autosave hooker
    """
    oRanking.save()
    
"""class platformRanking:
    pass
"""

class Ranking:
    """
    union
    """
    
    def __init__(self, github, stackoverflow):
        self.so = stackoverflow
        self.gh = github
        self.users = self.gh.users
        #self.join()
        
        """ 
        #old costly algorithm
        def join(self, keys=GITHUB_STACKOVERFLOW_KEYS):
        joined = []
        
        for u in range(len(self.users)):
            for kgh,kso in keys.items():
                #look for same value key by key in all stackoverflow users
                for uso in self.so.users:
                    try:
                        if uso[kso] and self.users[u][kgh].lower() == uso[kso].lower():
                            self.users[u].update({k:v for k,v in uso.items() if not self.users[u].get(k)})
                            joined.append(self.users[u]['login'])
                            #print("kgh: %s, value: %s, uso: %s, user: %s" % (kgh, self.users[u][kgh],uso,self.users[u]))
                            #input()
                            break
                    except AttributeError:
                        pass
                #exit loop if user already joined
                if joined and joined[-1]==self.users[u]['login']:
                    break
        
        return joined
        """  
        
    def join(self, keys=GITHUB_STACKOVERFLOW_KEYS):
        """
        join github and stackoverflow users
        """
        joined = []
        joined_keys = {kgh:{} for kgh in keys}
        
        """
        first index all GH users,
        secondly test if SO user is joinable
        then update
        """
        i=0
        len_users=len(self.users)
        for u in self.users+self.so.users:
            for kgh,kso in keys.items():
                #all gh unindexed users with a value for key must be indexed
                #try:
                if i<len_users:
                    try:
                        joined_keys[u[kgh].lower()] = u
                    except AttributeError:
                        pass #sometimes u[kgh] is "null" in JSON so None in Python
                    except KeyError:
                        print(i,u)
                        input()
                    continue
                    
                ukso = u[kso].lower()
                #then, only so users found in gh users are updated (only once)
                try:
                    if ukso and joined_keys[ukso][kgh].lower()==ukso:
                        """
                        HACK
                        SO name is sometimes a real name, sometimes a firstname, and sometimes a GH login due to OAuth SSO
                        therefore:
                        1) GH/SO name can't be used as a join id when it's just a firstname
                        2) SO name should be joined to GH login when it's relevant
                        Hence, my hypothesis:
                        I) GH name is usable as a join id when
                        a) SO name is a real name (>=2 words) OR
                        b) GH name is GH login (equal to GH login minus spaces forbidden in GH login)
                        too restrictive> II) GH login is usable as a join id when GH login is GH name (minus spaces & case)
                        """
                        if kgh=='name' and (len(ukso.split(' '))<2 or ukso.replace(' ','')!=joined_keys[ukso]["login"].lower().replace(' ','')):
                            pass #do not join cause it's neither a complete name nor a login
                        #elif kgh=="login" and u["name"].lower().replace(' ','')!=joined_keys[ukso]["name"].lower().replace(' ',''):
                        #    pass #do not join SO name with GH login if SO name different from GH name
                        else:#here we go^W join
                            update = {k:v for k,v in u.items() if v and not joined_keys[ukso].get(k)}
                            joined_keys[ukso].update(update)
                            joined.append(joined_keys[ukso])
                            break
         
                except KeyError:
                    pass            
            i+=1
            
        return joined
    
    def build_stats(self):
        scores = {}
        for u in self.users:
            score = u.get('reputation',0)+self.gh.score[u["login"]]
            u["score"] = score
            try:
                scores[score].append(u["login"])
            except KeyError:
                scores[score] = [u["login"]]
                
        return scores
    
    
class stackoverflowRanking:
    """
    Stackoverflow ranking
    """
    #Stackoverflow ranking
    
    def __init__(self, location=DEFAULT_LOCATION, autoload=True, autosave=True):
        self.users = []
        self.location = location
        
        if autoload:
            try:
                self.load()
            except FileNotFoundError:
                self.fetch()
        else:
            self.fetch()
        if autosave:
            atexit.register(_atexit_save, self)
    
    def fetch(self, csv_url=STACKOVERFLOW_CSV_URL, header=STACKOVERFLOW_CSV_HEADER):
        csv_url = csv_url.replace("$location", self.location)
        data_csv = requests.get(csv_url).text.splitlines()
        #data_csv = data_csv.splitlines()
        self._fill(data_csv, header)
    
    def load(self, csv_filename=STACKOVERFLOW_CSV_FILENAME, header=STACKOVERFLOW_CSV_HEADER):
        self._fill(open(csv_filename), header)
                
    def _fill(self, iterable, header):
        reader = csv.DictReader(iterable, header)
        next(reader) #skip headers
        for r in reader:
            self.users.append({
                "id" : r["id"],
                "name" : r["name"],
                "avatarUrl": r["avatarUrl"],
                "url": r["url"],
                "reputation": int(r["reputation"])
            })
    
    def save(self, filename=STACKOVERFLOW_CSV_FILENAME, header=STACKOVERFLOW_CSV_HEADER):
        """
        save in a proper CSV file
        """
        so_csv = csv.DictWriter(open(filename, "w"), header)
        so_csv.writerow({h:h for h in header})
        for user in self.users:
            so_csv.writerow(user)
            
    
    def build_stats(self):
        self.scores = {}
        for u in self.users:
            try:
                self.scores[u['reputation']].add(u['id'])
            except KeyError:
                self.scores[u['reputation']] = set([u['id']])

class githubRanking:
    """
    Github ranking
    """

    def __init__(self, token=GITHUB_TOKEN, location=DEFAULT_LOCATION, autoload=True, autosave=True, api_endpoint=GITHUB_GRAPHQL_ENDPOINT, request=GITHUB_GRAPHQL_FILENAME):
        self.users = []
        self.token = token
        self.graphql_request_file = request
        self.graphql_api_endpoint = api_endpoint
        
        #issue: search query is limited to 1000 results
        #solution: paging year by year to limit results
        #limit: if user creation is >1000 a year
        #solution for the future: paging month by month
        self.created_date_range = range(GITHUB_FIRST_YEAR, datetime.datetime.now().year+1)

        self.graphql_variables = {
         "location" : location,   
         "cursor" : '',
         "created_date": self.created_date_range[0],
        }
                    
        if autoload:
            try:
                self.load()
            except FileNotFoundError:
                self.fetch()
        else:
            self.fetch()
        if autosave:
            atexit.register(_atexit_save, self)
            
    def _graphql_query(self, created_date, cursor):
        """
        build an HTTP ready graphql query
        """
        self.graphql_variables['cursor'] = cursor
        self.graphql_variables['created_date'] = created_date
        
        try:
            graphql_query = string.Template(self.graphql_query_content)
        except AttributeError:
            self.graphql_query_content = open(self.graphql_request_file).read()
            self.graphql_query_content = '{"query": "%s"}' % self.graphql_query_content.replace('"',"\\\"").replace('\n','\\n')
            graphql_query = string.Template(self.graphql_query_content)
        
        graphql_query_json = graphql_query.substitute(self.graphql_variables)
        
        return graphql_query_json

    def fetch(self):
        """
        fetch all users data
        """        
        for created_date in self.created_date_range:
            hasNextPage = True
            cursor = ''
            while hasNextPage:
                #currently fetch only first 100 pull requests
                data_json = requests.post(self.graphql_api_endpoint,
                                          headers={'Authorization': "bearer %s " % self.token},
                                          data=self._graphql_query(created_date, cursor))
                try:
                    data = data_json.json()["data"]["location_users"]
                except TypeError:
                    raise TypeError("GraphQL query: %s\nAnswer: %s" % (self._graphql_query("users", created_date, cursor), data_json.text))
                
                self.users.extend([u['user'] for u in data["users"] if u['user']])
    
                hasNextPage = data["pageInfo"]["hasNextPage"]
                #print("created_date: %s %s, hasNextPage: %s, # users: %s" % (created_date, cursor, hasNextPage, len(data["users"])))

                cursor = ', after: \\"%s\\"' % data["pageInfo"]["endCursor"]
                
   
        return self.users
    
    def build_stats(self):
        """
        stats on repositories, users, languages
        """
        self.repositories = {}
        self.score = {}
        self.languages = {}
         
        for user in self.users:
            try:                
                id = user["login"]
                self.score[id] = 0
                
                for pull in user["pullRequests"]["pullMerged"]:
                    repo = pull["repository"]
                    
                    self.score[id] += repo["stargazers"]["totalCount"]
                    primaryLanguage = repo["primaryLanguage"]["name"] if repo["primaryLanguage"] else None
                    
                    try:
                        self.languages[primaryLanguage] += repo["stargazers"]["totalCount"]
                    except KeyError:
                        self.languages[primaryLanguage] = repo["stargazers"]["totalCount"]
                    
                    if not repo["name"] in self.repositories:
                        self.repositories[repo["name"]] = {
                            "stargazers": repo["stargazers"]["totalCount"],
                            "primaryLanguage": primaryLanguage,
                            "contributors": set([id])
                        }
                        user["repositories"] = set([repo["name"]])
                    else:
                        self.repositories[repo["name"]]["contributors"].add(id)
                        user["repositories"].add(repo["name"])
                    

            except KeyError:
                pass
        
        return self.score
    
    def load(self, method="filesystem", filename=GITHUB_RANKING_FILENAME):
        """
        load data from storage (current default and only method filesystem) in json format
        """
        file = open(filename, "r")
        self.users = json.load(file)
        
    def save(self, filename=GITHUB_RANKING_FILENAME):
        """
        save data in a storage (current default and only method: filesystem) in json format
        """
        if not self.users:
            return None
        try:
            file = open(filename, "x")
        except FileExistsError:
            file = open(filename, "w")

        json.dump(self.users, file, indent="\t", ensure_ascii=False)
   
if __name__ == "__main__":
    """ This is executed when run from the command line """
    gh = githubRanking(autoload=True, autosave=False)
    gh.build_stats()
    so = stackoverflowRanking(autoload=True, autosave=False)
    so.build_stats()
    r = Ranking(github=gh, stackoverflow=so)
    r.join()
    scores_users = r.build_stats()
    scores = list(scores_users.keys())
    scores.sort(reverse=True)
    top_users_repo = {}
    for i in range(min(100,len(scores))):
        users = scores_users[scores[i]]
        #print("%s: %s" % (scores[i], users))
        for user in users:
            top_users_repo[user] = [repo for repo,v in gh.repositories.items() if user in v["contributors"] and v["stargazers"]>=GITHUB_TOP_STARS]
    
    header = ["user", "repositories"]
    top_users_repo_csv = csv.DictWriter(open("top_users_repo.csv", "w"), header)
    top_users_repo_csv.writerow({h:h for h in header})
    for user,v in top_users_repo.items():
        top_users_repo_csv.writerow({'user':user, 'repositories':';'.join(v)})

Wait a minute...

GraphQL?

REST

  • every URL is an object
  • HTTP verb are methods (PUT, GET, POST, DEL)
  • Web as an API

 

GraphQL

  • Only one endpoint
  • Query only what you want, get only your results
  • Pagination included
  • Less verbose
  • Designed by Facebook, implemented by Github

GraphQL Query

{
  location_users: search(type: USER, query: "location:$location created:${created_date}", first: 100 $cursor) {
    users: edges {
      user: node {
        ... on User {
          login
          name
          email
          avatarUrl
          url
          websiteUrl
  
          pullRequests(first: 100, orderBy: {field: CREATED_AT, direction: DESC})
          {
            totalCount
            pullMerged: nodes{
                ... on PullRequest
                {
                    url
                    createdAt
                    repository
                    {
                     name
                      primaryLanguage {
                        name
                      }
                      stargazers {
                        totalCount
                      }
                    }
                }
            }
          }
        }
      }
    }
    pageInfo {
        endCursor
        hasNextPage
    }
    
  }
}

Now, the results...

Contribution by language

100 top contributors

Key figures

  • Less than 1 contribution per user
  • the 100 top users represents 77,2% contributions (1482 / 1920 contributions) and almost 100% of stars (major projects)
  • the top 10 users represents >1/3 of all contributions (617)

Some thoughts

  • minio fabulous project explains GO enigma
  • other tunisian-made opensource products (eg. bigbluebutton...)
  • contributions to famous or promising products (invoiceninja, odooo, prestashop) and framework (django, symfony, drupal, Apache Spark, Apache Kafka)
  • why those major tunisian opensource contributors don't do more techtalks to encourage others?

Roadmap

  • mapping network relationship (follow)
  • auto-discovery of latent relationship, projects to foster  local collaboration
  • ...

Next steps

release tunisians.py, engage the community.tn, improve, contribute more, iterate

Thank you for your attention

bibliography

  • Wikimedia / Wikipedia
  • Felipe Hoffa, https://medium.com/@hoffa/the-top-github-projects-per-country-92c275e19409 https://medium.com/@hoffa/github-top-countries-201608-13f642493773
  • GitHub
  • Stackoverflow
  • M. D.

Opensource business models

By Badreddine Ladjemi

Opensource business models

  • 654