Lisa Jung
IBM
Upkar Lidder
IBM
> ulidder@us.ibm.com
> @lidderupk
> upkar.dev
@lidderupk
IBM Developer
1. Create IBM Cloud Account using THIS URL
3. If you already have an account, use the above URL to sign into your IBM Cloud account.
2. Check your email and activate your account. Once activated, log back into your IBM Cloud account using the link above.
http://bit.ly/pixie-sign
@lidderupk
IBM Developer
1. What is Apache Spark?
2. Why do we need it? A historical context
3. Basic architecture and components
4. Spark data structures
5. Commonly used APIs to work with DataFrames
6. Spark and PixieDust demo
@lidderupk
IBM Developer
Framework for Big Data processing distributed across clusters, like map-reduce. It provides high-level APIs in Java, Scala, Python and R.
@lidderupk
IBM Developer
Era of Big Data ...
ETL tools are old school ...
Hadoop is cool, but ...
@lidderupk
IBM Developer
Spark SQL - Full SQL2003 support, DataFrames and Datasets
MLlib - Spark's scalable machine learning library. Supports Classification, Regression, Decision Trees, Recommendation, Clustering with feature transformation, ML pipelines, Model evaluation, hyper-parameter optimization and tuning.
Spark Streaming - brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python.
Spark GraphX - API for graphs and graph-parallel computation.
@lidderupk
IBM Developer
@lidderupk
IBM Developer
Transformations | Actions |
---|---|
select | count |
distinct | collect |
sum | save |
filter | show |
limit | more ... |
groupBy | |
more ... |
@lidderupk
IBM Developer
A dataframe is collection of pyspark.sql.Row
from pyspark.sql import Row
california = Row(state='California', abbr='CA')
arizona = Row(state='Arizona', abbr='AZ')
states = spark.createDataFrame([california, arizona])
@lidderupk
IBM Developer
# drop(how='any', thresh=None, subset=None)
df5 = df4.na.drop()
df5.show()
# fill(value, subset=None)
df4.na.fill(50).show()
df5.na.fill(False).show()
# replace(to_replace, value=<no value>, subset=None)[source]
df4.na.replace(10, 20).show()
df4.na.replace('Alice', None).show()
df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show()
@lidderupk
IBM Developer
@lidderupk
IBM Developer
import pyspark.sql.functions as func
@lidderupk
IBM Developer
@lidderupk
IBM Developer
# ---------------------------------------
# Cleanse age (enforce numeric data type)
# ---------------------------------------
def fix_age(col):
"""
input: pyspark.sql.types.Column
output: the numeric value represented by col or None
"""
try:
return int(col)
except ValueError:
# age-33
match = re.match('^age\-(\d+)$', col)
if match:
try:
return int(match.group(1))
except ValueError:
return None
return None
fix_age_UDF = func.udf(lambda c: fix_age(c), types.IntegerType())
customer_df = customer_df.withColumn("AGE", fix_age_UDF(customer_df["AGE"]))
customer_df
@lidderupk
IBM Developer
# ------------------------------
# Derive gender from salutation
# ------------------------------
def deriveGender(col):
""" input: pyspark.sql.types.Column
output: "male", "female" or "unknown"
"""
if col in ['Mr.', 'Master.']:
return 'male'
elif col in ['Mrs.', 'Miss.']:
return 'female'
else:
return 'unknown';
# register the user defined function
deriveGenderUDF = func.udf(lambda c: deriveGender(c), types.StringType())
# crate a new column by deriving GENDER from GenderCode
customer_df = customer_df.withColumn("GENDER", deriveGenderUDF(customer_df["GenderCode"]))
customer_df.cache()
@lidderupk
IBM Developer
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
# nothing happens
condition_gen_x_y = "GENERATION = 'Gen_X' or GENERATION = 'Gen_Y'"
# nothing happens
boomers_df = customer_df.filter("GENERATION = 'Gen_X' or GENERATION = 'Gen_Y'")
# something happens now!
boomers_df.groupBy('GENERATION').count().show()
# convert to pandas dataframe from spark dataframe!
boomers_df = boomers_df.toPandas()
# eager evaluation now!
boomers_df.groupby('GENERATION')['GENERATION'].count().plot(kind='bar')
plt.show()
@lidderupk
IBM Developer
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
@lidderupk
IBM Developer
PixieDust is an open source helper library that works as an add-on to Jupyter notebooks to improve the user experience of working with data.
One single API called display() lets you visualize your Spark object in different ways: table, charts, maps, etc.
display(raw_df)
@lidderupk
IBM Developer
#Spark CSV Loading
from pyspark.sql import SparkSession
try:
from urllib import urlretrieve
except ImportError:
#urlretrieve package has been refactored in Python 3
from urllib.request import urlretrieve
data_url = "https://data.cityofnewyork.us/api/views/e98g-f8hy/rows.csv?accessType=DOWNLOAD"
urlretrieve (data_url, "building.csv")
spark = SparkSession.builder.getOrCreate()
building_df = spark.read\
.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
.option('header', True)\
.load("building.csv")
building_df
import pixiedust
pixiedust.sampleData(data_url)
Spark
PixieDust
Code from Data Analysis with Python, David Taieb
@lidderupk
IBM Developer
http://bit.ly/pixie-lab
@lidderupk
IBM Developer
Use Apache Spark, PixieDust and Jupyter Notebooks to analyze and visualize customer purchase data from Github. Run the notebook on a cluster of distributed nodes on IBM Cloud.
@lidderupk
IBM Developer
@lidderupk
IBM Developer
http://bit.ly/pixie-sign
@lidderupk
IBM Developer
@lidderupk
IBM Developer
@lidderupk
IBM Developer
@lidderupk
IBM Developer
@lidderupk
IBM Developer
@lidderupk
IBM Developer
@lidderupk
IBM Developer
@lidderupk
IBM Developer
@lidderupk
IBM Developer
@lidderupk
IBM Developer
Grab the FULL URL from : http://bit.ly/pixie-lab-notebook
@lidderupk
http://bit.ly/pixie-lab-notebook
https://training.databricks.com/visualapi.pdf
@lidderupk
IBM Developer
Slides explaining transformations and actions with visuals
https://spark.apache.org/docs/2.3.3/api/python/index.html
Spark official docs
https://docs.databricks.com/spark/latest/dataframes-datasets/index.html
Databricks training
https://developer.ibm.com/patterns/category/spark/?fa=date%3ADESC&fb=
IBM Code Patterns
@lidderupk
IBM Developer
Upkar Lidder, IBM
@lidderupk
https://github.com/lidderupk/
ulidder@us.ibm.com