Apache Spark
@ TVision
Outline
Talk #1
- What is Spark?
-
Spark Terminology
-
Why does Spark exist?
-
How do we use it?
-
Reductions
-
Impressions
-
Bacchus
-
-
Spark Architecture - Part I
-
Spark Lifecycle
-
Spark Execution - Part I
Talk #2+
-
Distributed Systems
-
Ecosystem & History
-
-
Interlude: DataFrames
-
Spark Programming
-
Structured APIs, RDDs, Distributed Variables
-
-
Spark Architecture - Part II
-
Spark Execution - Part II
-
Spark Ecosystem
-
Interlude: Columnar Formats
-
Spark Optimization
-
Catalyst & Tungsten Internals
-
Logistics
-
Lots of jargon
-
Bold terms can be found in Spark Terminology and Concepts
-
-
Two part talk (maybe more)
-
Part I
-
What Spark is, what problems it solves, some origin stories on how it was developed, and where it’s being used at TVision
-
-
Part II
-
Programming with spark, Spark's features, internals, and the ecosystem
-
-
Part I
Spark Origins and Fundementals
What is Spark?
What is Spark?
Let's look at the official site
Spark for Programmers
- Framework for managing + coordinating tasks on data distributed across a cluster
- Able to mix complex procedural operations and relational queries in a compositional way
- Nice abstractions that make distributed programming easy to work with
Spark for FP
-
Spark has clearly been heavily influenced by modern functional programming
-
Provides simple, battle-tested solutions to a lot of industrial problems that languages like Haskell don't (yet)
Terminology Dump
Architecture
-
Cluster
- or group, of computers, that pools their resources together so that we can use all the cumulative resources as if they were a single computer
-
Cluster manager
-
manages the cluster of machines that Spark uses to execute tasks.
-
Examples include Spark's standalone cluster manager, Hadoop Yarn, and Apache Mesos.
- Spark is cluster manager agnostic. While it natively supports the Hadoop YARN cluster manager, it requires nothing from Hadoop itself.
-
Examples include Spark's standalone cluster manager, Hadoop Yarn, and Apache Mesos.
-
manages the cluster of machines that Spark uses to execute tasks.
Platform
-
Google File System
- a proprietary distributed file-system that stores data on commodity machines
-
MapReduce
- a programming model and implementation for processing and generating huge data sets with a parallel, distributed algorithm on a cluster
-
Hadoop
- Open source GFS/MR implementation
- Hadoop Distributed File System (HDFS)
- Hadoop YARN – a cluster manager
- Hadoop MapReduce
- Spark does not provide it's own storage, runs on top of something like HDFS or S3 instead
- Open source GFS/MR implementation
Spark Platform
-
Spark
- Like Hadoop MapReduce, Spark is an open-source, distributed processing system. However, unlike Hadoop MapReduce, Spark uses directed acyclic graphs (DAG) for execution plans and in-memory caching for datasets.
-
PySpark
-
API for writing spark in Python.
-
If you’re using the structured APIs, your code should run just about as fast as if you had written it in Scala.
-
-
Spark SQL
-
Provides DataFrames API + Catalyst optimizer
-
Spark Engine / "Compiler"
-
Catalyst
- Spark's logical planner and relational optimizer for operations on Structured APIs used to maintain its type information through planning and processing
-
Tungsten
- Spark physical planner and execution optimizer (after Catalyst optimization), which improves memory and CPU efficiency
Why does Spark exist?
What is Batch Processing?
- Takes a large amount of input data, runs a job to process it, and produces some output data.
- Jobs often take a while (ex. few minutes to days)
- Scheduled to run periodically
- The primary performance measure of a batch job is usually throughput
- time to process input dataset of a certain size (ex. 99 GB/sec)
-- BatchJobScript.hs
proccessBigData :: BigInput -> BigOutput
-- batch.crontab
10 10 * * * ./run-batch-job.sh input.txt
Scaling Batch Processing
- Example: In 2003, Google was indexing the web
- 20+ billion web pages * 20 KB/page = 400+ Terabytes of data
Batch Processing
On single computer
- Ex. Process and store 400+ TB
- Say they had 1 computer that could read 50 MB/sec from disk
- 3 months to read the web
- ~1000 disks just to store the web
- Even more to do anything with it
BAD
Scaling Batch Processing
On many computers
- Ex. Process and store 400+ TB
- 1000 machines, < 3 hours
GOOD
Scaling Batch Processing
On many computers
- Bad news:
-
Programming complications
- Communication and coordination
- Recovering from machine failure
- Status reporting
- Debugging
- Optimization
- Locality
- Need to figure all of these for problem and solve again for any other processing problem you have
-
Programming complications
To solve problems at scale, paradoxically, you have to know the smallest details"
Alan Eustace (Former Engineering Head @ Google)
The Paradox of Scale
Necessary Fault Tolerance
Imagine your average computer stays up for 3 years before it experiences some hardware or operating system failure at which point it keels over. That’s not such a big deal, except if you are running a computation on 1000s of machines that takes on the order of a day.
You will run into some sort of failure during that computation. You have to be prepared for failure at the software level because when the computations are large enough, you will experience failures across machines.”
Jeff Dean (co-author of MapReduce, BigTable, Spanner, TensorFlow, The Universe)
Quote from https://youtu.be/quSmkZtty4o?t=392
MapReduce
- A programming model and implementation for processing and generating huge data sets with a parallel, distributed algorithm on a cluster
ELI5: MapReduce
- Example: Parallel program for counting all the books in a library
-
Map
-
You count up shelf #1, I count up shelf #2
-
Whenever one of us finishes we move to next uncounted shelf
-
The more people we get, the faster it goes.
-
-
-
Reduce
-
Now we get together and add our individual counts
-
-
Example from https://news.ycombinator.com/item?id=2849163
MapReduce
-
Provides reliable, scalable, maintainable way to process lots of data on lots of cheap, commodity software
-
Write in a functional style
-
Map
-
Apply said function to distributed data (which spread many computers)
-
-
Reduce
-
Aggregate transformed data and get a result
-
-
Shuffle
-
Redistribute data on cluster
-
-
-
Example: Render Map Tiles
From Jeff Dean's Building Software Systems at Google and Lessons Learned lecture
Parallel MapReduce
From Jeff Dean's Building Software Systems at Google and Lessons Learned lecture
Slide from Jeff Dean's Building Software Systems at Google and Lessons Learned lecture
Reality of Machine Failure
If we hadn’t had to deal with failures (of computers), if we had a perfectly reliable set of computers to run (our code) on, we probably never would've implemented MapReduce. Without having failures, the support code (that MapReduce provides) just isn’t complicated.
Sanjay Ghemawat (co-author of GFS, MapReduce, BigTable)
Quote from https://youtu.be/quSmkZtty4o?t=339
Problem with MapReduce
-
Materialization of intermediate state
-
Mappers are often redundant (by map/fold fusion)
- They just read back the same file that was just written by a reducer, and prepare it for the next stage of partitioning and sorting. In many cases, the mapper code could be part of the previous reducer
-
Overkill for temporary data
- Storing in a distributed filesystem means those files are replicated across several nodes, which is often
- Jobs can only start when all tasks in the preceding jobs have completed. Waiting slows down the execution of the workflow as a whole.
-
Mappers are often redundant (by map/fold fusion)
MapReduce
HDFS
Input
Output
Intermediate Output
Intermediate Output
f
g
h
Spark
HDFS
Input
Output
Intermediate Output
Intermediate Output
f
g
h
Spark
- Storing and reading data in memory is much faster buts adds a lot of complexity in a distributed setting
- Invented RDD abstraction to solve this
Need for Recomputation
- Spark, Flink, and Tez avoid writing intermediate state to HDFS, so they take a different approach to tolerating faults
- If a machine fails and the intermediate state on that machine is lost, it is recomputed from other data that is still available
- Intermediate state over input if possible
- Requires deterministic operations
- If a machine fails and the intermediate state on that machine is lost, it is recomputed from other data that is still available
- To enable this recomputation, the framework must keep track of how a given piece of data was computed—which input partitions it used, and which operators were applied to it
- Spark uses RDDs for tracking ancestry of data
Complexity of MapReduce
-
MapReduce is still lower level than programmers like me want to write software with
-
Don’t want to always have to reason about whether we need a broadcast hash join, map-side merge join, or whatever
-
Implementing a complex processing job using the raw MapReduce APIs is actually quite hard and laborious
-
- RDDs provide some help but not a lot
Spark SQL
-
Compositional interface for mixing complex procedural operations and relational queries
-
DataFrame API
- Can perform relational operations on both external data sources and Spark’s built-in distributed collections.
-
Catalyst
- Extensible optimizer
-
Towards Declarative Languages
Optimization #1
-
The choice of join algorithm can make a big difference to the performance of a batch job
Spark, Flink, and Hive have query optimizers
Towards Declarative Languages
Optimization #2
-
Hive, Spark DataFrames, and Impala also use vectorized execution
- Iterating over data in a tight inner loop that is friendly to CPU caches, and avoiding function calls
-
Spark generates JVM bytecode and Impala uses LLVM to generate native code for these inner loops.
Towards Declarative Languages
Optimization #3
-
If a function contains only a simple filtering condition, or it just selects some fields from a record, then there is significant CPU overhead in calling the function on every record
- If such simple filtering and mapping operations are expressed in a declarative way, the query optimizer can take advantage of column-oriented storage layouts and read only the required columns from disk
Spark at TVision
Reduction
- Transform raw content and presence (ACR and CV) into get second-by-second observations
-
Determine who is watching and whether they're paying attention, and what they're watching
-
See RACR Flow and Huginn Reduction Flow
Spark + Tracker
Reduction
From Device to Backend to Redshift
Impressions
- Ad and Program Impressions
- Next generation ranking build
Bacchus
-
Run ad-hoc spark jobs on flintrock (maybe kubernetes in future)
- Process ground truth videos
-
Cloud CV processing
- Run OpenVINO on nodes in cluster
- End-to-end testing from videos to CV to reduced CV
Spark Architecture
Part I
Spark Architecture
-
Spark Application
-
Job you want to run on Spark, which consists of a driver process and a set of executor processes
-
-
Driver
-
The driver process runs your main() function, sits on a node in the cluster, and is responsible for:
-
Maintaining all relevant information during the lifetime of the Spark Application
- Responding to a user’s program or input
- Analyzing, distributing, and scheduling work across the executors
-
- It must interface with the cluster manager in order to actually get physical resources and launch executors.
-
-
Executors
-
Responsible for actually carrying out the work that the driver assigns them. Each executor is responsible for:
-
Executing code assigned to it by the driver
- Reporting the state of the computation on that executor back to the driver node (ie success/failure and results)
-
- Each Spark Application has its own separate executor processes.
-
-
Spark Session
-
You control your Spark Application through a driver process called the SparkSession.
-
The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application.
-
-
In Scala and Python, a Spark session is available as spark when you start Spark in the console / Spark Shell.
-
Spark Components
Communication
Spark Lifecycle
Lifecycle of Spark Application
Initiation
-
Request gets made to the cluster manager driver node asking for resources
- Asking for resources for the Spark driver process only
- Cluster manager accepts this offer and places the driver onto a node in the cluster
- The client process that submitted the original job exits and the application is off and running on the cluster
Lifecycle of Spark Application
Standby
Lifecycle of Spark Application
Launch
-
Driver process on the cluster begins running user code
- User code must provide a SparkSession that initializes a Spark cluster (e.g., driver + executors)
- The SparkSession will subsequently communicate with the cluster manager, asking it to launch Spark executor processes across the cluster
- The cluster manager responds by launching the executor processes and sends the relevant information about their locations to the driver process
Lifecycle of Spark Application
Execution
- The driver and the workers communicate among themselves, executing code and moving data around
- The driver schedules tasks onto each worker, and each worker responds with the status of those tasks and success or failure
Lifecycle of Spark Application
Completion
- Once the Spark Application completes, the driver process exits with either success or failure
- The cluster manager then shuts down the executors in that Spark cluster for the driver, at which point you can see the success or failure of the Spark Application by asking the cluster manager
Part II
Spark Programming and Internals
Outline
- RDDs
- Spark Execution
- Interlude: DataFrames
- Spark Architecture II
- PySpark Architecture
-
Interlude: DataFrames
-
Spark Programming
-
Structured APIs
-
RDDs
-
Distributed Variables
-
-
Spark Ecosystem
-
Interlude: Columnar Formats
-
Spark at TVision - Part II
-
EMR
-
Spark UI
-
-
Spark Optimization
-
Catalyst & Tungsten Internals
-
-
MapReduce Joins
-
Further Resources
Jeff Dean Facts
-
To Jeff Dean, "NP" means "No Problemo"
-
Jeff Dean's IDE doesn't do code analysis, it does code appreciation
-
Jeff Dean's PIN is the last 4 digits of pi
-
Google Search was Jeff Dean's N(ew G)oogler Project
-
Jeff Dean invented MapReduce so he could sort his fan mail
-
Emacs' preferred editor is Jeff Dean
-
Jeff Dean doesn't exist, he's actually an advanced AI created by Jeff Dean
-
Jeff Dean compiles and runs his code before submitting, but only to check for compiler and CPU bugs
RDD Fundementals
Resilient Distributed Dataset
- Represent an immutable, partitioned collection of elements that can be operated on in parallel.
- RDDs are made up of:
-
Partitions
-
Atomic pieces of the dataset. One or many per compute node
-
- A function for computing the dataset based on its parent RDDs.
-
Dependencies
-
Models relationship between this RDD and its partitions with the RDD(s) it was derived from.
-
- Metadata about it partitioning scheme and data placement
-
How to get an RDD?
- Two ways
- Parallelizing an existing collection in your driver program
- SparkContext.parallelize()
- Referencing a dataset in an external storage system, such as a HDFS, S3, Postgres
- Data Source API
- SparkContext.textFile('hdfs://data.txt')
- Data Source API
- Parallelizing an existing collection in your driver program
RDDs
-
Computations on RDDs are represented as a lineage graph, a DAG representing the computations done on the RDD.
- This representation/DAG is what Spark analyzes to do optimizations.
rdd = sc.textFile(...)
filtered = \
rdd.map(...)\
.filter(...)\
.persist()
count = filtered.count()
reduced = filtered.reduce()
Ex. Recomputing RDDs
Failure
Ex. Recomputing RDDs
Recovery
Various RDDs
Example Program w/ RDDs
Problem:
You collect lots of application logs and would like to analyze error events.
Before you can do this, you need to remove rows corresponding to other events (INFO, DEBUG, etc).
You have a cluster at your disposal to do this processing. Write a driver program that collects and aggregates all
error events.
Example from RDD Fundementals video
Example Program w/ RDDs
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
app.log
$
Driver Program
Cluster
Partitions
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
logLinesRDD = sc.textFile(
"app.log",
minPartitions=4
)
Partition #1
Partition #2
Partition #3
Partition #4
Error, ts, msg1
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
Error, ts, msg3
errorsRDD = logLinesRDD.filter(
lambda log: log[0] == "Error"
)
Partition #1
Partition #2
Partition #3
Partition #4
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
cleanedRDD = errorsRDD.coalesce(2)
Partition #1
Partition #2
result = cleanedRDD.collect()
write_to_log("error.log", data=result)
$ cat error.log
Error, ts, msg1 Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
Example: On Driver
Error
Warn
Error
Info
Warn
Info
Error
Warn
Error
Error
Info
Info
Error
Error
Error
Error
Error
Error
Error
Error
Error
Error
app.log
on local driver
logLines
errors
cleaned
$
$
error.log
on local driver
Example: From HDFS to S3
Error
Warn
Error
Info
Warn
Info
Error
Warn
Error
Error
Info
Info
Error
Error
Error
Error
Error
Error
Error
Error
Error
Error
logLines
errors
cleaned
Block 1
Block 2
Block 3
Block 4
s3://err2.log
s3://err1.log
data Partition = Partition
{ partionIndex :: Int }
data Partitioner =
HashPartitioner | RangePartitioner | ..
data DependencyFlavor =
Narrow | Shuffle | None
data Dependency f a =
{ parent :: forall f. (RDD f) => f a
, flavor :: DependencyFlavor }
class RDD f where
{-# MINIMAL getDeps, getPartitions, compute #-}
getPartitions :: [Partition]
compute :: Partition -> TaskCtx -> [a]
getDeps :: [Dependency f a]
getPreferedLoc :: Partition -> [Text]
getPartitioner :: Maybe Partitioner
-- Data Source API
parallelize :: (RDD f) => [a] -> f a
-- Transformations
intersection :: (RDD f) => f a -> f a -> f a
cartesian :: (RDD f) => f a -> f a -> f a
-- Actions
count :: (RDD f) => f a -> Long
data TaskState =
Completed | Interrupted | RunningLocally
data TaskCtx = TaskCtx
{ state :: TaskState
, attemptNum :: Int
, partitionId :: Int
, stageId :: Int, ..more config and state }
HadoopRDD
Method | Implementation | Note |
---|---|---|
Partitions | One per HDFS block | |
Dependencies | None | Base/Input RDD |
Compute | Read corresponding block | |
Preferred Location | HDFS block | |
Partitioner | None | Just partition per block, no repartitioning going on |
FilteredRDD
Method | Implementation | Note |
---|---|---|
Partitions | Same as parent | |
Dependencies | One to one (narrow) | |
Compute | Filter | Go to parent's partition and filter it |
Preferred Location | None | Ask parent |
Partitioner | None | Probably parent partitioner |
JoinedRDD
Method | Implementation | Note |
---|---|---|
Partitions | One per reduce task | |
Dependencies | Shuffle on each parent | |
Compute | Read and join shuffled data | |
Preferred Location | None (sometimes inherit) | Typically has to get data over network. Sometimes aligns to Parent RDD's location |
Partitioner | HashPartitioner |
Specialized Connector RDDs
- CassandraRDD
- Pushdown predicate and projection
- Push down filters into Cassandra so you only select columns/rows matching predicate
- More on these kinds of optimizations later
- Rather than reading full data into Spark and filtering after (ex. HDFS)
- Pushdown predicate and projection
Spark Execution
-
Transformations
-
instructions you provide Spark about how you would like to modify a DataFrame / RDD
-
Lazy, not evaluated until an action is called
-
Can be "narrow" or "wide"
-
-
Actions
-
instructions to Spark to compute a result from a series of transformations
-
Eager (force evaluation)
-
Upon calling an action, Spark creates, optimizes, and runs an execution plan.
-
-
-
Narrow Transformations
- Transformations for which each input partition contributes to one output partition (AKA narrow dependencies)
-
Wide Transformations / Shuffles
- Transformations for which input partitions contribute to many output partitions (AKA wide dependencies).
- Whenever Spark performs a shuffle, it must write results to disk (AKA shuffle persistence).
Dependencies
Narrow or Wide?
- You ask your friend for $100, who has exactly $100 to give you
$100
Can you lend me $100?
Narrow or Wide?
- You ask your friend for $100, who has exactly $100 to give you
$100
Can you lend me $100?
Narrow!
Narrow or Wide?
- You ask your friends for $100, each of whom gives you $25 of their $100
$100
$100
$100
$100
Can someone lend me $100?
$100
$25
$25
$25
$25
Narrow or Wide?
- You ask your friends for $100, each of whom gives you $25 of their $100
$100
$100
$100
$100
Can someone lend me $100?
$100
$25
$25
$25
$25
Narrow!
Narrow or Wide?
- You ask your friends for $50, two of whom have $50 to give you
$50
Can I borrow $50 each?
$50
$0
$0
$50
$50
Narrow or Wide?
- You ask your friends for $50, two of whom have $50 to give you
$50
Can I borrow $50 each?
$50
$0
$0
$50
$50
Narrow!
Narrow or Wide?
- You ask your friends, two of whom have $50, for $50 in $25 increments for each hand
$50
Can I borrow $50 each?
$50
$50
$50
$25
$25
$25
$25
Narrow or Wide?
- You ask your friends, two of whom have $50, for $50 in $25 increments for each hand
$50
Can I borrow $50 each?
$50
$50
$50
Wide!
$25
$25
$25
$25
Ex. Model dependencies
Visualized DAG
Ex. Model dependencies
Resolved DAG
- The B to G join is narrow because groupByKey already partitions the keys and places them appropriately in B after shuffling. Thus operations like join can sometimes be narrow and sometimes be wide.
Transformations
Transformations with (usually) Narrow dependencies:
- map
- mapValues
- flatMap
- filter
- mapPartitions
- mapPartitionsWithIndex
Transformations with (usually) Wide dependencies: (might cause a shuffle)
- cogroup
- groupWith
- join
- leftOuterJoin
- rightOuterJoin
- groupByKey
- reduceByKey
- combineByKey
- distinct
- intersection
- repartition
- coalesce
Example Spark Job
Narrow
Narrow
Narrow
Wide
Action
Given the following program, how should Spark execute it?
(How to divy tasks AKA computations on a partition)
Task per transition
Narrow
Narrow
Narrow
Wide
Action
Task per transition
- Too many tasks
- Lots of intermediate state
- High Overhead
- Each operation on i-th partition loops over input individually
Task per Output Partition
Narrow
Narrow
Narrow
Wide
Action
Task per Output Partition
- Fewer tasks (4 < 12)
- Less intermediate state
- Each task has to do a lot more
- Wide transformations / Shuffles
- Have to recompute all input tasks if any part of shuffle fails
Stages of Tasks per Shuffle
Narrow
Wide
Action
Stages of Tasks per Shuffle
-
Pipelining
-
operation that Spark automatically performs on narrow transformations that allows multiple transformations to be performed in-memory
- AKA no data movement
-
operation that Spark automatically performs on narrow transformations that allows multiple transformations to be performed in-memory
Without Pipelining
With Pipelining
Spark Job Terms
-
Spark job
- Each Spark application is made up of one or more Spark jobs. Spark jobs within an application are executed serially (unless you use threading to launch multiple actions in parallel).
- Actions always return results. Each job breaks down into a series of stages, the number of which depends on how many shuffle operations need to take place.
Spark Job Terms
-
Stages - represent groups of tasks that can be executed together to compute the same operation on multiple machines.
- In general, Spark will try to pack as much work as possible (i.e., as many transformations as possible inside your job) into the same stage, but the engine starts new stages after every shuffle.
Spark Job Terms
-
Tasks
-
A unit of computation applied to a unit of data (the partition). Each task corresponds to a combination of blocks of data and a set of transformations that will run on a single executor.
- If there is one big partition in our dataset, we will have one task. If there are 1,000 little partitions, we will have 1,000 tasks that can be executed in parallel.
-
Partitioning your data into a greater number of partitions means that more can be executed in parallel
-
A unit of computation applied to a unit of data (the partition). Each task corresponds to a combination of blocks of data and a set of transformations that will run on a single executor.
The Spark Shuffle
-
A physical repartitioning of the data
-
Ex. Sorting a DataFrame, or grouping data that was loaded from a file by key (which requires sending records with the same key to the same node).
- This type of repartitioning requires coordinating across executors to move data around. Spark starts a new stage after each shuffle, and keeps track of what order the stages must run in to compute the final result.
-
Ex. Sorting a DataFrame, or grouping data that was loaded from a file by key (which requires sending records with the same key to the same node).
-
Ex. reduce-by-key
- Where input data for each key needs to first be brought together from many nodes
Shuffle Steps
- “Source” tasks (those sending data) write shuffle files to their local disks during their execution stage.
- Grouping and reduction stage launches and runs tasks that fetch their corresponding records from each shuffle file and performs that computation
- Ex. fetches and processes the data for a specific range of keys
- Saving the shuffle files to disk lets:
- Spark run this stage later in time than the source stage
- If there are not enough executors to run both at the same time the engine re-launch reduce tasks on failure without rerunning all the input tasks.
- Spark run this stage later in time than the source stage
Shuffle Persistence
-
Shuffle Persistence
- The step of saving files to disk
-
Allows new jobs running over over data that’s already been shuffled to skip re-running the “source” side of the shuffle.
- Because the shuffle files were already written to disk earlier, Spark knows that it can use them to run the later stages of the job.
More on persistence
-
Shuffle Persistence
- The step of saving files to disk
-
Allows new jobs running over over data that’s already been shuffled to skip re-running the “source” side of the shuffle.
- Because the shuffle files were already written to disk earlier, Spark knows that it can use them to run the later stages of the job.
Spark Architecture
Part II
Spark Hardware Hierarchy
Spark Hardware Hierarchy
- Cluster, Driver, and Executors
-
Cores / Slots
- available threads to process partitions
- NOT physical CPU cores on each machine (unfortunate terminology by Spark)
- Working memory is utilized by Spark workloads
- Disks used for:
- Persistence to disks and spills for workload
- Shuffle partitions for shuffle stages
Spark Software Hierarchy
Tasks = Cores = Slots
1 Task 1 Partition
1 Slot 1 Core
Stages
Jobs
Actions
Spark Software Hierarchy
Tasks = Cores = Slots
1 Task 1 Partition
1 Slot 1 Core
Stages
Jobs
Actions
- Actions are eager
- Made of transformations (lazy)
- Narrow
- Wide / Shuffle
- Spawn jobs
- Spawn stages
- Spawn tasks
- Do work and utilize hardware
- Only part that uses hardware, rest for orchestration
- All tasks in same stage do the same thing
- Do work and utilize hardware
- Spawn tasks
- Spawn stages
- Made of transformations (lazy)
Spark UI / History Server
Documentation in Confluence
Go to history server on QA Cluster (EMR): https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1#cluster-details:j-1RT410A48AI05
PySpark Architecture
Pyspark
- Thin library that sits on top of Java API, which sits on top of Scala core engine
Explanation from Pyspark Architecture video
Example: Pyspark Program
ex: collect()
ex: aws s3 cp
user Python code
Interlude: DataFrames
DataFrames vs SQL Tables
- Table
- a set of records (rows) / a relation
- Transformations defined by relational algebra
- "Protects users from needing to know how the data is organized in the machine, and makes it possible for users to specify high-level queries, and leads to an inexhaustible number of optimization techniques"
DataFrames vs SQL Tables
- DataFrames
- Multiple definitions dependent on implementation
DataFrame APIs
- R DataFrames
- Python Pandas
- Haskell Frames
- Spark DataFrames
- Koalas Pandas on Spark DataFrames
Spark vs Pandas DataFrames
Pandas | Spark | |
---|---|---|
Column | ||
Mutability | Mutable | Immutable |
Add column | ||
Rename column | ||
Value Count |
Slide from Announcing Koalas Open Source Project
df['col']
df['col']
df['c'] = df['a'] + df['b']
df.withColumn(
'c',
df['a'] + df['b']
)
df.columns = ['a', 'b']
df.select(
df['col1'].alias('a'),
df['col2'].alias('b')
)
df['col'].value_counts()
df.groupBy(df['col']\
.count()\
.orderBy(
'count', ascending=False)
)
Spark Programming
Spark Apis
-
Low Level
- RDD
- Distributed Variables
- Broadcast vars
- Accumulators
- High Level (Structured APIs)
- DataFrames
- DataSets
- Third Party
- Frameless
- Quill
Spark Apis Compared
From Databrick's A Tale of Three Apache Spark APIs
RDDs
- Can be cached in-memory, which is a massive win for iterative algorithms
- Type-safe in implementation language (Scala)
-
A lot like scala collections
- Except distributed, lazy, immutable
def topHashtags(tweets: RDD[Tweet], n: Int
): -> Array[(String, Int)]
tweets\
.flatMap(lambda c: c.text.split("\\s+"))\ # split it into words
.filter(lambda c: c.startsWith("#"))\ # filter hashtag words
.map(lambda c: c.toLowerCase)\ # normalize hashtags
.map(lambda c: (c, 1))\ # create tuples for counting
.reduceByKey(lambda a, b => a + b)\ # accumulate counters
.top(n).sortBy(lambda c: c[1]) # return ordered top hashtags
Example from Quill article
RDDs
- Catch errors at compile-time
Integer RDD
String RDD
Double RDD
When to use RDDs
- Low-level API and control
- Compile time typechecking
- Low level API
Quill
- https://medium.com/@fwbrasil/quill-spark-a-type-safe-scala-api-for-spark-sql-2672e8582b0d
Spark Ecosystem
Spark Ecosystem
- Spark Libraries
- MLlib
- Spark Streaming
- GraphFrames
Graph processing using Cypher graph query language (Spark 3.0)
- Third Party Libraries
-
Flint
- Time-series Library for Spark
-
Flint
Interlude: Columnar
Title Text
-
https://towardsdatascience.com/demystify-hadoop-data-formats-avro-orc-and-parquet-e428709cf3bb
- Row based better for write-heavy disk, since appending is easier
- Parquet
- on disk
- record shredding and assembly algo based off Dremel
- on disk
- Arrow
- In-memory
Spark Optimization
What is Spark?
-
https://www.youtube.com/watch?v=RmUn5vHlevc
- Explains catalyst (Video is fantastic)
-
- Pure functions, fixed points, immutable trees, rewrites
- Transformations
- Two kinds
- transform trees without changing the type of tree (ex. expression -> expression, logical plan -> logical plan, or physical plan -> physical plan)
- transform tree into different type of tree (used for logical plan -> physical plan)
- Two kinds
>>> import findspark
>>> findspark.init()
>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.appName('Spark Tech Talk').getOrCreate()
# Spark computation on two tables
>>> t1 = spark.range(2000000)
>>> t2 = spark.range(2000000)
>>> result = t1.join(t2, on=t1.id == t2.id).groupBy().count()
# # See execution plan
>>> result.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [id#0L], [id#2L], Inner
:- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *(1) Range (0, 2000000, step=1, splits=12)
+- *(4) Sort [id#2L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#2L], Exchange hashpartitioning(id#0L, 200)
# Takes a few seconds
>>> result.show()
+-------+
| count|
+-------+
|2000000|
+-------+
Ex. Spark Plan and Execution
...
>>> result.explain(extended=True)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#19L]
+- Join Inner, (id#0L = id#2L)
:- Range (0, 2000000, step=1, splits=Some(12))
+- Range (0, 2000000, step=1, splits=Some(12))
== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#19L]
+- Join Inner, (id#0L = id#2L)
:- Range (0, 2000000, step=1, splits=Some(12))
+- Range (0, 2000000, step=1, splits=Some(12))
== Optimized Logical Plan ==
Aggregate [count(1) AS count#19L]
+- Project
+- Join Inner, (id#0L = id#2L)
:- Range (0, 2000000, step=1, splits=Some(12))
+- Range (0, 2000000, step=1, splits=Some(12))
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)], output=[count#19L])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#22L])
+- *(5) Project
+- *(5) SortMergeJoin [id#0L], [id#2L], Inner
:- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *(1) Range (0, 2000000, step=1, splits=12)
+- *(4) Sort [id#2L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#2L], Exchange hashpartitioning(id#0L, 200)
Detailed Logical/Physical Plan
Spark SQL Engine
Spark SQL Engine
- From declarative queries to RDDs
- Understanding Query Plans and Spark UIs
MapReduce Joins
Reduce-side Joins and Grouping
- Sort-merge
- GROUP BY
- Skew Join / Sharded Join
Map-side Joins
- Broadcast Hash Join
- Partitioned Hash Join
- Merge Join
Broadcast Hash Join
- Different name
- Map-side join - Hadoop community
- Star-schema join
- Replicated join
- Join a large table (fact) with relatively small tables (dimensions) to avoid sending all data of the large table over the network
See Spark SQL Joins for basic, API level joins
See Mastering Spark SQL: Broadcast Joins for Spark broadcast details
Spark Joins
Title Text
-
Optimizing Apache Spark SQL Joins
- Basic
- Shuffle Hash Join
- Broadcast Hash Join
- Cartesian Join
- Special
- Theta Join
- One to Many Join
- Basic
-
Working with Skewed Data: The Iterative Broadcast
- Iterative Broadcast
Shuffle Merge Join
-
Was removed in favor of sort merge join in 1.6, but re-added in 2.0
-
ShuffledHashJoin is still useful when:
-
Any partition of the build side could fit in memory
-
The build side is much smaller than stream side, the building hash table on smaller side should be faster than sorting the bigger side.
-
-
Sort Merge Join is more robust
-
Shuffled Hash Join requires the hashed table to fit in memory
-
Sort Merge Join which can spill to disk
-
-
Apache Spark
By Heneli Kailahi
Apache Spark
- 313