Talk #1
Spark Terminology
Why does Spark exist?
How do we use it?
Reductions
Impressions
Bacchus
Spark Architecture - Part I
Spark Lifecycle
Spark Execution - Part I
Talk #2+
Distributed Systems
Ecosystem & History
Interlude: DataFrames
Spark Programming
Structured APIs, RDDs, Distributed Variables
Spark Architecture - Part II
Spark Execution - Part II
Spark Ecosystem
Interlude: Columnar Formats
Spark Optimization
Catalyst & Tungsten Internals
Lots of jargon
Bold terms can be found in Spark Terminology and Concepts
Two part talk (maybe more)
Part I
What Spark is, what problems it solves, some origin stories on how it was developed, and where it’s being used at TVision
Part II
Programming with spark, Spark's features, internals, and the ecosystem
Spark Origins and Fundementals
Let's look at the official site
Spark has clearly been heavily influenced by modern functional programming
Provides simple, battle-tested solutions to a lot of industrial problems that languages like Haskell don't (yet)
PySpark
API for writing spark in Python.
If you’re using the structured APIs, your code should run just about as fast as if you had written it in Scala.
Spark SQL
Provides DataFrames API + Catalyst optimizer
-- BatchJobScript.hs
proccessBigData :: BigInput -> BigOutput
-- batch.crontab
10 10 * * * ./run-batch-job.sh input.txt
On single computer
On many computers
On many computers
To solve problems at scale, paradoxically, you have to know the smallest details"
Alan Eustace (Former Engineering Head @ Google)
Imagine your average computer stays up for 3 years before it experiences some hardware or operating system failure at which point it keels over. That’s not such a big deal, except if you are running a computation on 1000s of machines that takes on the order of a day.
You will run into some sort of failure during that computation. You have to be prepared for failure at the software level because when the computations are large enough, you will experience failures across machines.”
Jeff Dean (co-author of MapReduce, BigTable, Spanner, TensorFlow, The Universe)
Quote from https://youtu.be/quSmkZtty4o?t=392
Map
You count up shelf #1, I count up shelf #2
Whenever one of us finishes we move to next uncounted shelf
The more people we get, the faster it goes.
Reduce
Now we get together and add our individual counts
Example from https://news.ycombinator.com/item?id=2849163
Provides reliable, scalable, maintainable way to process lots of data on lots of cheap, commodity software
Write in a functional style
Map
Apply said function to distributed data (which spread many computers)
Reduce
Aggregate transformed data and get a result
Shuffle
Redistribute data on cluster
From Jeff Dean's Building Software Systems at Google and Lessons Learned lecture
From Jeff Dean's Building Software Systems at Google and Lessons Learned lecture
Slide from Jeff Dean's Building Software Systems at Google and Lessons Learned lecture
If we hadn’t had to deal with failures (of computers), if we had a perfectly reliable set of computers to run (our code) on, we probably never would've implemented MapReduce. Without having failures, the support code (that MapReduce provides) just isn’t complicated.
Sanjay Ghemawat (co-author of GFS, MapReduce, BigTable)
Quote from https://youtu.be/quSmkZtty4o?t=339
HDFS
Input
Output
Intermediate Output
Intermediate Output
f
g
h
HDFS
Input
Output
Intermediate Output
Intermediate Output
f
g
h
MapReduce is still lower level than programmers like me want to write software with
Don’t want to always have to reason about whether we need a broadcast hash join, map-side merge join, or whatever
Implementing a complex processing job using the raw MapReduce APIs is actually quite hard and laborious
Compositional interface for mixing complex procedural operations and relational queries
DataFrame API
Optimization #1
The choice of join algorithm can make a big difference to the performance of a batch job
Spark, Flink, and Hive have query optimizers
Optimization #2
Hive, Spark DataFrames, and Impala also use vectorized execution
Spark generates JVM bytecode and Impala uses LLVM to generate native code for these inner loops.
Optimization #3
If a function contains only a simple filtering condition, or it just selects some fields from a record, then there is significant CPU overhead in calling the function on every record
Determine who is watching and whether they're paying attention, and what they're watching
See RACR Flow and Huginn Reduction Flow
From Device to Backend to Redshift
Spark Application
Job you want to run on Spark, which consists of a driver process and a set of executor processes
Driver
The driver process runs your main() function, sits on a node in the cluster, and is responsible for:
Maintaining all relevant information during the lifetime of the Spark Application
Executors
Responsible for actually carrying out the work that the driver assigns them. Each executor is responsible for:
Executing code assigned to it by the driver
Spark Session
You control your Spark Application through a driver process called the SparkSession.
The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application.
In Scala and Python, a Spark session is available as spark when you start Spark in the console / Spark Shell.
Communication
Initiation
Standby
Launch
Execution
Completion
Spark Programming and Internals
Interlude: DataFrames
Spark Programming
Structured APIs
RDDs
Distributed Variables
Spark Ecosystem
Interlude: Columnar Formats
Spark at TVision - Part II
EMR
Spark UI
Spark Optimization
Catalyst & Tungsten Internals
MapReduce Joins
Further Resources
To Jeff Dean, "NP" means "No Problemo"
Jeff Dean's IDE doesn't do code analysis, it does code appreciation
Jeff Dean's PIN is the last 4 digits of pi
Google Search was Jeff Dean's N(ew G)oogler Project
Jeff Dean invented MapReduce so he could sort his fan mail
Emacs' preferred editor is Jeff Dean
Jeff Dean doesn't exist, he's actually an advanced AI created by Jeff Dean
Jeff Dean compiles and runs his code before submitting, but only to check for compiler and CPU bugs
Partitions
Atomic pieces of the dataset. One or many per compute node
Dependencies
Models relationship between this RDD and its partitions with the RDD(s) it was derived from.
rdd = sc.textFile(...)
filtered = \
rdd.map(...)\
.filter(...)\
.persist()
count = filtered.count()
reduced = filtered.reduce()
Failure
Recovery
Problem:
You collect lots of application logs and would like to analyze error events.
Before you can do this, you need to remove rows corresponding to other events (INFO, DEBUG, etc).
You have a cluster at your disposal to do this processing. Write a driver program that collects and aggregates all
error events.
Example from RDD Fundementals video
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
app.log
$
Driver Program
Cluster
Partitions
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
logLinesRDD = sc.textFile(
"app.log",
minPartitions=4
)
Partition #1
Partition #2
Partition #3
Partition #4
Error, ts, msg1
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
Error, ts, msg3
errorsRDD = logLinesRDD.filter(
lambda log: log[0] == "Error"
)
Partition #1
Partition #2
Partition #3
Partition #4
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
cleanedRDD = errorsRDD.coalesce(2)
Partition #1
Partition #2
result = cleanedRDD.collect()
write_to_log("error.log", data=result)
$ cat error.log
Error, ts, msg1 Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
Error
Warn
Error
Info
Warn
Info
Error
Warn
Error
Error
Info
Info
Error
Error
Error
Error
Error
Error
Error
Error
Error
Error
app.log
on local driver
logLines
errors
cleaned
$
$
error.log
on local driver
Error
Warn
Error
Info
Warn
Info
Error
Warn
Error
Error
Info
Info
Error
Error
Error
Error
Error
Error
Error
Error
Error
Error
logLines
errors
cleaned
Block 1
Block 2
Block 3
Block 4
s3://err2.log
s3://err1.log
data Partition = Partition
{ partionIndex :: Int }
data Partitioner =
HashPartitioner | RangePartitioner | ..
data DependencyFlavor =
Narrow | Shuffle | None
data Dependency f a =
{ parent :: forall f. (RDD f) => f a
, flavor :: DependencyFlavor }
class RDD f where
{-# MINIMAL getDeps, getPartitions, compute #-}
getPartitions :: [Partition]
compute :: Partition -> TaskCtx -> [a]
getDeps :: [Dependency f a]
getPreferedLoc :: Partition -> [Text]
getPartitioner :: Maybe Partitioner
-- Data Source API
parallelize :: (RDD f) => [a] -> f a
-- Transformations
intersection :: (RDD f) => f a -> f a -> f a
cartesian :: (RDD f) => f a -> f a -> f a
-- Actions
count :: (RDD f) => f a -> Long
data TaskState =
Completed | Interrupted | RunningLocally
data TaskCtx = TaskCtx
{ state :: TaskState
, attemptNum :: Int
, partitionId :: Int
, stageId :: Int, ..more config and state }
Method | Implementation | Note |
---|---|---|
Partitions | One per HDFS block | |
Dependencies | None | Base/Input RDD |
Compute | Read corresponding block | |
Preferred Location | HDFS block | |
Partitioner | None | Just partition per block, no repartitioning going on |
Method | Implementation | Note |
---|---|---|
Partitions | Same as parent | |
Dependencies | One to one (narrow) | |
Compute | Filter | Go to parent's partition and filter it |
Preferred Location | None | Ask parent |
Partitioner | None | Probably parent partitioner |
Method | Implementation | Note |
---|---|---|
Partitions | One per reduce task | |
Dependencies | Shuffle on each parent | |
Compute | Read and join shuffled data | |
Preferred Location | None (sometimes inherit) | Typically has to get data over network. Sometimes aligns to Parent RDD's location |
Partitioner | HashPartitioner |
Transformations
instructions you provide Spark about how you would like to modify a DataFrame / RDD
Lazy, not evaluated until an action is called
Can be "narrow" or "wide"
Actions
instructions to Spark to compute a result from a series of transformations
Eager (force evaluation)
Upon calling an action, Spark creates, optimizes, and runs an execution plan.
$100
Can you lend me $100?
$100
Can you lend me $100?
$100
$100
$100
$100
Can someone lend me $100?
$100
$25
$25
$25
$25
$100
$100
$100
$100
Can someone lend me $100?
$100
$25
$25
$25
$25
$50
Can I borrow $50 each?
$50
$0
$0
$50
$50
$50
Can I borrow $50 each?
$50
$0
$0
$50
$50
$50
Can I borrow $50 each?
$50
$50
$50
$25
$25
$25
$25
$50
Can I borrow $50 each?
$50
$50
$50
$25
$25
$25
$25
Visualized DAG
Resolved DAG
Transformations with (usually) Narrow dependencies:
Transformations with (usually) Wide dependencies: (might cause a shuffle)
Narrow
Narrow
Narrow
Wide
Action
Given the following program, how should Spark execute it?
(How to divy tasks AKA computations on a partition)
Narrow
Narrow
Narrow
Wide
Action
Narrow
Narrow
Narrow
Wide
Action
Narrow
Wide
Action
Without Pipelining
With Pipelining
Partitioning your data into a greater number of partitions means that more can be executed in parallel
Tasks = Cores = Slots
1 Task 1 Partition
1 Slot 1 Core
Stages
Jobs
Actions
Tasks = Cores = Slots
1 Task 1 Partition
1 Slot 1 Core
Stages
Jobs
Actions
Documentation in Confluence
Go to history server on QA Cluster (EMR): https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1#cluster-details:j-1RT410A48AI05
Explanation from Pyspark Architecture video
ex: collect()
ex: aws s3 cp
user Python code
Pandas | Spark | |
---|---|---|
Column | ||
Mutability | Mutable | Immutable |
Add column | ||
Rename column | ||
Value Count |
Slide from Announcing Koalas Open Source Project
df['col']
df['col']
df['c'] = df['a'] + df['b']
df.withColumn(
'c',
df['a'] + df['b']
)
df.columns = ['a', 'b']
df.select(
df['col1'].alias('a'),
df['col2'].alias('b')
)
df['col'].value_counts()
df.groupBy(df['col']\
.count()\
.orderBy(
'count', ascending=False)
)
From Databrick's A Tale of Three Apache Spark APIs
def topHashtags(tweets: RDD[Tweet], n: Int
): -> Array[(String, Int)]
tweets\
.flatMap(lambda c: c.text.split("\\s+"))\ # split it into words
.filter(lambda c: c.startsWith("#"))\ # filter hashtag words
.map(lambda c: c.toLowerCase)\ # normalize hashtags
.map(lambda c: (c, 1))\ # create tuples for counting
.reduceByKey(lambda a, b => a + b)\ # accumulate counters
.top(n).sortBy(lambda c: c[1]) # return ordered top hashtags
Example from Quill article
Integer RDD
String RDD
Double RDD
>>> import findspark
>>> findspark.init()
>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.appName('Spark Tech Talk').getOrCreate()
# Spark computation on two tables
>>> t1 = spark.range(2000000)
>>> t2 = spark.range(2000000)
>>> result = t1.join(t2, on=t1.id == t2.id).groupBy().count()
# # See execution plan
>>> result.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [id#0L], [id#2L], Inner
:- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *(1) Range (0, 2000000, step=1, splits=12)
+- *(4) Sort [id#2L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#2L], Exchange hashpartitioning(id#0L, 200)
# Takes a few seconds
>>> result.show()
+-------+
| count|
+-------+
|2000000|
+-------+
...
>>> result.explain(extended=True)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#19L]
+- Join Inner, (id#0L = id#2L)
:- Range (0, 2000000, step=1, splits=Some(12))
+- Range (0, 2000000, step=1, splits=Some(12))
== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#19L]
+- Join Inner, (id#0L = id#2L)
:- Range (0, 2000000, step=1, splits=Some(12))
+- Range (0, 2000000, step=1, splits=Some(12))
== Optimized Logical Plan ==
Aggregate [count(1) AS count#19L]
+- Project
+- Join Inner, (id#0L = id#2L)
:- Range (0, 2000000, step=1, splits=Some(12))
+- Range (0, 2000000, step=1, splits=Some(12))
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)], output=[count#19L])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#22L])
+- *(5) Project
+- *(5) SortMergeJoin [id#0L], [id#2L], Inner
:- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *(1) Range (0, 2000000, step=1, splits=12)
+- *(4) Sort [id#2L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#2L], Exchange hashpartitioning(id#0L, 200)
See Spark SQL Joins for basic, API level joins
See Mastering Spark SQL: Broadcast Joins for Spark broadcast details
ShuffledHashJoin is still useful when:
Any partition of the build side could fit in memory
The build side is much smaller than stream side, the building hash table on smaller side should be faster than sorting the bigger side.
Sort Merge Join is more robust
Shuffled Hash Join requires the hashed table to fit in memory
Sort Merge Join which can spill to disk