Apache Spark
Taras Lehinevych
Agenda
Apache Spark
Apache Spark
Apache Spark
Text
Spark Survey 2015
Resilient Distributed Datasets (RDD)
Dataset – variable or object:
Distributed:
Resilient:
RDD
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Python
Scala
Why Not Python + RDD?
Why Not Python + RDD?
RDD
Advantage:
Disadvantage:
Performance
DataFrame
DataFrame
DataFrame
DataFrame
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= "2016-04-23")
DataFrame
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()
textFile = sc.textFile("hdfs://...")
# Creates a DataFrame having a single column named "line"
df = textFile.map(lambda r: Row(r)).toDF(["line"])
errors = df.filter(col("line").like("%ERROR%"))
# Counts all the errors
errors.count()
# Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
# Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()
DataFrame
Advantage:
Disadvantage:
Dataset
Preview in Spark 1.6
Best of both worlds:
Dataset
Dataset
NO PYTHON SUPPORT
Dataset
val lines = sc.textFile("/wikipedia")
val words = lines
.flatMap(_.split(" "))
.filter(_ != "")
val counts = words
.groupBy(_.toLowerCase)
.map(w => (w._1, w._2.size))
RDDs
Datasets
val lines = sqlContext.read.text("/wikipedia").as[String]
val words = lines
.flatMap(_.split(" "))
.filter(_ != "")
val counts = words
.groupBy(_.toLowerCase)
.count()
Dataset
Dataset
Performance optimization
Custom encoders
Python Support
Unification of DataFrames with Datasets
Summary
DataFrame is the best option for Python and production
Waiting for Dataset + Python
Sources
Databricks Blog - databricks.com/blog
Cloudera Engineerig Blog- blog.cloudera.com
Spark Community (mailing list)
Contacts
Website - https://taraslehinevych.me
Email - info@taraslehinevych.me
Twitter - @lehinevych
Thank you
Questions?