Introduction to Parallel Execution in Python

Michael Jalkio

San Diego Python - March 22, 2018

"Python is Fast,
but the Internet is Really Slow"

Michael Jalkio

San Diego Python - March 22, 2018

name facebook
United Way Worldwide UnitedWay
Task Force for Global Health TheTaskForceforGlobalHealth
Feeding America FeedingAmerica
Salvation Army SalvationArmyUSA
YMCA of the USA YMCA
St. Jude Children’s Hospital stjude
Food for the Poor FoodForThePoor
Boys & Girls Club of America bgca.clubs
Catholic Charities USA catholiccharitiesusa
import facebook


class FacebookClient:
    """Simple class to get basic information on Facebook Pages."""

    def __init__(self, access_token):
        """Initialize GraphAPI object."""
        self.graph = facebook.GraphAPI(access_token=access_token,
                                       version='2.7')

    def get_page_fan_count(self, page_id):
        """Return number of fans for the given page."""
        page = self.graph.get_object(id=page_id, fields='fan_count')
        return page['fan_count']

    def get_page_about(self, page_id):
        """Return some information about the given page."""
        page = self.graph.get_object(id=page_id, fields='about')
        return page['about']

Disclaimer

This is not good code!

import os

import pandas as pd
from dotenv import load_dotenv

from facebook_client import FacebookClient

load_dotenv()
fb = FacebookClient(access_token=os.getenv('FACEBOOK_ACCESS_TOKEN'))
nonprofit_df = pd.read_csv('nonprofit_facebook.csv')

nonprofit_df['fan_count'] = nonprofit_df['facebook'].map(fb.get_page_fan_count)
nonprofit_df['about'] = nonprofit_df['facebook'].map(fb.get_page_about)
nonprofit_df.to_csv('output.csv', index=False)
(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python main.py 
name facebook fan_count about
United Way Worldwide UnitedWay 212501 To live better, we must Live United.
Task Force for Global Health TheTaskForceforGlobalHealth 1504 The Task Force for Global Health provides all people with opportunities to lead healthy, productive lives.
Feeding America FeedingAmerica 603864 Our mission is to feed America's hungry through a nationwide network of member food banks and engage our country in the fight to end hunger.  You can help.
Salvation Army SalvationArmyUSA 307914 The Salvation Army is committed to doing the most good for the most people in the most need. The nation's largest faith-based charity, The Salvation Army serves 30 million people each year through a broad array of social services.
YMCA of the USA YMCA 351368 The Y: We're for youth development, healthy living and social responsibility.
St. Jude Children’s Hospital stjude 2216537 Welcome to the St. Jude Children’s Research Hospital Facebook page. Before you post, please review our posting policy located on the About page.
Food for the Poor FoodForThePoor 370988 Food For The  Poor feeds millions of hungry people throughout the countries we serve. www.foodforthepoor.org
Boys & Girls Club of America bgca.clubs 204625 Great Futures Start Here!
Catholic Charities USA catholiccharitiesusa 95090 Working to reduce poverty in America for over 100 years.
import os
import time

import pandas as pd
from dotenv import load_dotenv

from facebook_client import FacebookClient

load_dotenv()
fb = FacebookClient(access_token=os.getenv('FACEBOOK_ACCESS_TOKEN'))
nonprofit_df = pd.read_csv('nonprofit_facebook.csv')

t0 = time.perf_counter()
nonprofit_df['facebook'].map(fb.get_page_fan_count)
nonprofit_df['facebook'].map(fb.get_page_about)
t1 = time.perf_counter()
print("Sequential map time elapsed: {time} seconds.".format(time=t1 - t0))

t0 = time.perf_counter()
nonprofit_df['facebook'].map(lambda x: x + 'lol')
nonprofit_df['facebook'].map(lambda x: 500)
t1 = time.perf_counter()
print("Non-network map time elapsed: {time} seconds.".format(time=t1 - t0))
(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Sequential map time elapsed: 10.18712910101749 seconds.
Non-network map time elapsed: 0.0004884299705736339 seconds.
from multiprocessing import Pool

t0 = time.perf_counter()
with Pool(processes=4) as pool:
    pool.map(fb.get_page_fan_count, nonprofit_df['facebook'])
    pool.map(fb.get_page_about, nonprofit_df['facebook'])
t1 = time.perf_counter()
print("Multiprocessing pool time elapsed: {time} seconds.".format(time=t1 - t0))
(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Sequential map time elapsed: 10.18712910101749 seconds.
Non-network map time elapsed: 0.0004884299705736339 seconds.

(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Multiprocessing pool time elapsed: 3.7498577430378646 seconds.

Dask

Dask is a flexible parallel computing library
for analytic computing.

import dask.dataframe as dd

nonprofit_df_dask = dd.read_csv('nonprofit_facebook.csv')
t0 = time.perf_counter()
nonprofit_df_dask['facebook'].map(fb.get_page_fan_count,
                                  meta=('fan_count', int)).compute()
nonprofit_df_dask['facebook'].map(fb.get_page_about,
                                  meta=('about', str)).compute()
t1 = time.perf_counter()
print("Dask time elapsed: {time} seconds.".format(time=t1 - t0))
(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Sequential map time elapsed: 10.18712910101749 seconds.
Non-network map time elapsed: 0.0004884299705736339 seconds.

(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Multiprocessing pool time elapsed: 3.7498577430378646 seconds.

(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Dask time elapsed: 10.75200344400946 seconds.
import dask.dataframe as dd

nonprofit_df_dask = dd.read_csv('nonprofit_facebook.csv', blocksize=200)
t0 = time.perf_counter()
nonprofit_df_dask['facebook'].map(fb.get_page_fan_count,
                                  meta=('fan_count', int)).compute()
nonprofit_df_dask['facebook'].map(fb.get_page_about,
                                  meta=('about', str)).compute()
t1 = time.perf_counter()
print("Dask (fixed) time elapsed: {time} seconds.".format(time=t1 - t0))
(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Sequential map time elapsed: 10.18712910101749 seconds.
Non-network map time elapsed: 0.0004884299705736339 seconds.

(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Multiprocessing pool time elapsed: 3.7498577430378646 seconds.

(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Dask time elapsed: 10.75200344400946 seconds.

(venv) Michaels-MacBook-Pro:parallel-python-tutorial michael$ python timing.py 
Dask (fixed) time elapsed: 3.777425266976934 seconds.

Pros & Cons

import os

import pandas as pd
from dotenv import load_dotenv

from facebook_client import FacebookClient

load_dotenv()
fb = FacebookClient(access_token=os.getenv('FACEBOOK_ACCESS_TOKEN'))
nonprofit_df = pd.read_csv('nonprofit_facebook.csv')

nonprofit_df['fan_count'] = nonprofit_df['facebook'].map(fb.get_page_fan_count)
nonprofit_df['about'] = nonprofit_df['facebook'].map(fb.get_page_about)
nonprofit_df.to_csv('output.csv', index=False)

The Future...

Pandas on Ray

Make Pandas faster
by replacing one line of your code.

Thank you!

Introduction to Parallel Execution in Python

By mjalkio

Introduction to Parallel Execution in Python

  • 895