Apache Spark

@ TVision

Outline

 Talk #1

  • What is Spark?
  • Spark Terminology

  • Why does Spark exist?

  • How do we use it?

    • Reductions

    • Impressions

    • Bacchus

  • Spark Architecture - Part I

  • Spark Lifecycle

  • Spark Execution - Part I

 Talk #2+

  • Distributed Systems

    • Ecosystem & History

  • Interlude: DataFrames

  • Spark Programming

    • ​Structured APIs, RDDs, Distributed Variables

  • Spark Architecture - Part II

  • Spark Execution - Part II

  • Spark Ecosystem

  • Interlude: Columnar Formats

  • Spark Optimization

    • Catalyst & Tungsten Internals

Logistics

  • Lots of jargon

  • Two part talk (maybe more)

    • Part I

      • What Spark is, what problems it solves, some origin stories on how it was developed, and where it’s being used at TVision

    • Part II

      • Programming with spark, Spark's features, internals, and the ecosystem

Part I

Spark Origins and Fundementals

What is Spark?

What is Spark?

Let's look at the official site

Spark for Programmers

  • Framework for managing + coordinating tasks on data distributed across a cluster
  • Able to mix complex procedural operations and relational queries in a compositional way
  • Nice abstractions that make distributed programming easy to work with

Spark for FP

  • Spark has clearly been heavily influenced by modern functional programming

  • Provides simple, battle-tested solutions to a lot of industrial problems that languages like Haskell don't (yet)

Terminology Dump

Architecture

  • Cluster
    • or group, of computers, that pools their resources together so that we can use all the cumulative resources as if they were a single computer
  • Cluster manager 
    • manages the cluster of machines that Spark uses to execute tasks.
      • Examples include Spark's standalone cluster manager, Hadoop Yarn, and Apache Mesos.
        • ​​Spark is cluster manager agnostic. While it natively supports the Hadoop YARN cluster manager, it requires nothing from Hadoop itself.

Platform

  • Google File System
    • a proprietary distributed file-system that stores data on commodity machines
  • MapReduce
    • a programming model and implementation for processing and generating huge data sets with a parallel, distributed algorithm on a cluster
  • Hadoop
    • Open source GFS/MR implementation
      • Hadoop Distributed File System (HDFS)
      • Hadoop YARN – a cluster manager
      • Hadoop MapReduce
    • Spark does not provide it's own storage, runs on top of something like HDFS or S3 instead

Spark Platform

  • Spark
    • Like Hadoop MapReduce, Spark is an open-source, distributed processing system. However, unlike Hadoop MapReduce, Spark uses directed acyclic graphs (DAG) for execution plans and in-memory caching for datasets.
  • PySpark

    • API for writing spark in Python.

    • If you’re using the structured APIs, your code should run just about as fast as if you had written it in Scala.

  • Spark SQL

    • Provides DataFrames API + Catalyst optimizer

Spark Engine / "Compiler"

  • Catalyst
    • Spark's logical planner and relational optimizer for operations on Structured APIs used to maintain its type information through planning and processing
  • Tungsten
    • Spark physical planner and execution optimizer (after Catalyst optimization), which improves memory and CPU efficiency

Why does Spark exist?

What is Batch Processing?

  • Takes a large amount of input data, runs a job to process it, and produces some output data.
  • Jobs often take a while (ex. few minutes to days)
    • Scheduled to run periodically
    • The primary performance measure of a batch job is usually throughput
      • time to process input dataset of a certain size (ex. 99 GB/sec)
-- BatchJobScript.hs
proccessBigData :: BigInput -> BigOutput

-- batch.crontab
10 10 * * * ./run-batch-job.sh input.txt

Scaling Batch Processing

  • Example: In 2003, Google was indexing the web
    • 20+ billion web pages * 20 KB/page = 400+ Terabytes of data

Batch Processing

On single computer

  • Ex. Process and store 400+ TB
  • Say they had 1 computer that could read 50 MB/sec from disk 
    • 3 months to read the web
  • ~1000 disks just to store the web
    • Even more to do anything with it

BAD

Scaling Batch Processing

​On many computers

  • Ex. Process and store 400+ TB
  • 1000 machines, < 3 hours

GOOD

Scaling Batch Processing

​On many computers

  • Bad news:
    • Programming complications
      • Communication and coordination
      • Recovering from machine failure
      • Status reporting
      • Debugging
      • Optimization
      • Locality
    • Need to figure all of these for problem and solve again for any other processing problem you have

To solve problems at scale, paradoxically, you have to know the smallest details"

Alan Eustace (Former Engineering Head @ Google)

The Paradox of Scale

Necessary Fault Tolerance

Imagine your average computer stays up for 3 years before it experiences some hardware or operating system failure at which point it keels over. That’s not such a big deal, except if you are running a computation on 1000s of machines that takes on the order of a day.

You will run into some sort of failure during that computation. You have to be prepared for failure at the software level because when the computations are large enough, you will experience failures across machines.”

Jeff Dean (co-author of MapReduce, BigTable, Spanner, TensorFlow, The Universe)

MapReduce

  • A programming model and implementation for processing and generating huge data sets with a parallel, distributed algorithm on a cluster

ELI5: MapReduce

  • Example: Parallel program for counting all the books in a library
    • Map

      • You count up shelf #1, I count up shelf #2

        • Whenever one of us finishes we move to next uncounted shelf

        • The more people we get, the faster it goes.

    • Reduce

      • Now we get together and add our individual counts

MapReduce

  • Provides reliable, scalable, maintainable way to process lots of data on lots of cheap, commodity software

    • Write in a functional style

      • ​​Map

        • Apply said function to distributed data (which spread many computers)

      • Reduce

        • Aggregate transformed data and get a result

      • Shuffle

        • Redistribute data on cluster

Example: Render Map Tiles

Parallel MapReduce

Reality of Machine Failure

If we hadn’t had to deal with failures (of computers), if we had a perfectly reliable set of computers to run (our code) on, we probably never would've implemented MapReduce. Without having failures, the support code (that MapReduce provides) just isn’t complicated.

Sanjay Ghemawat (co-author of GFS, MapReduce, BigTable)

Problem with MapReduce

  • Materialization of intermediate state
    • Mappers are often redundant (by map/fold fusion)
      • They just read back the same file that was just written by a reducer, and prepare it for the next stage of partitioning and sorting. In many cases, the mapper code could be part of the previous reducer
    • Overkill for temporary data
      • Storing in a distributed filesystem means those files are replicated across several nodes, which is often
    • Jobs can only start when all tasks in the preceding jobs have completed. Waiting slows down the execution of the workflow as a whole.

MapReduce

HDFS

Input

Output

Intermediate Output

Intermediate Output

f

g

h

Spark

HDFS

Input

Output

Intermediate Output

Intermediate Output

f

g

h

Spark

  • Storing and reading data in memory is much faster buts adds a lot of complexity in a distributed setting
    • Invented RDD abstraction to solve this

Need for Recomputation

  • Spark, Flink, and Tez avoid writing intermediate state to HDFS, so they take a different approach to tolerating faults
    • If a machine fails and the intermediate state on that machine is lost, it is recomputed from other data that is still available
      • Intermediate state over input if possible
    • Requires deterministic operations
  • To enable this recomputation, the framework must keep track of how a given piece of data was computed—which input partitions it used, and which operators were applied to it
    • Spark uses RDDs for tracking ancestry of data

Complexity of MapReduce

  • MapReduce is still lower level than programmers like me want to write software with

    • Don’t want to always have to reason about whether we need a broadcast hash join, map-side merge join, or whatever

    • Implementing a complex processing job using the raw MapReduce APIs is actually quite hard and laborious

  • RDDs provide some help but not a lot

Spark SQL

  • Compositional interface for mixing complex procedural operations and relational queries

    • DataFrame API

      • Can perform relational operations on both external data sources and Spark’s built-in distributed collections.
    • Catalyst
      • Extensible optimizer

Towards Declarative Languages

Optimization #1

  • The choice of join algorithm can make a big difference to the performance of a batch job

    • Spark, Flink, and Hive have query optimizers

Towards Declarative Languages

Optimization #2

  • Hive, Spark DataFrames, and Impala also use vectorized execution

    • Iterating over data in a tight inner loop that is friendly to CPU caches, and avoiding function calls
  • Spark generates JVM bytecode and Impala uses LLVM to generate native code for these inner loops.

Towards Declarative Languages

Optimization #3

  • If a function contains only a simple filtering condition, or it just selects some fields from a record, then there is significant CPU overhead in calling the function on every record

    • If such simple filtering and mapping operations are expressed in a declarative way, the query optimizer can take advantage of column-oriented storage layouts and read only the required columns from disk

Spark at TVision

Reduction

  • Transform raw content and presence (ACR and CV) into get second-by-second observations
    • Determine who is watching and whether they're paying attention, and what they're watching

Spark + Tracker

Reduction

From Device to Backend to Redshift

Impressions

  • Ad and Program Impressions
    • Next generation ranking build

Bacchus

  • Run ad-hoc spark jobs on flintrock (maybe kubernetes in future)
    • Process ground truth videos
  • Cloud CV processing
    • Run OpenVINO on nodes in cluster
  • End-to-end testing from videos to CV to reduced CV

Spark Architecture

Part I

Spark Architecture

  • Spark Application

    • Job you want to run on Spark, which consists of a driver process and a set of executor processes

  • ​​Driver

    • The driver process runs your main() function, sits on a node in the cluster, and is responsible for:

      • Maintaining all relevant information during the lifetime of the  Spark Application

      • Responding to a user’s program or input
      • Analyzing, distributing, and scheduling work across the executors
    • It must interface with the cluster manager in order to actually get physical resources and launch executors.
  • Executors 

    • Responsible for actually carrying out the work that the driver assigns them. Each executor is responsible for:

      • Executing code assigned to it by the driver

      • Reporting the state of the computation on that executor back to the driver node (ie success/failure and results)
    • Each Spark Application has its own separate executor processes.
  • Spark Session 

    • You control your Spark Application through a driver process called the SparkSession.

      • The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application.

    • In Scala and Python, a Spark session is available as spark when you start Spark in the console / Spark Shell.

Spark Components

Communication

Spark Lifecycle

Lifecycle of Spark Application

Initiation

  1. Request gets made to the cluster manager driver node asking for resources
    • ​​​Asking for resources for the Spark driver process only
  2. Cluster manager accepts this offer and places the driver onto a node in the cluster
  3. The client process that submitted the original job exits and the application is off and running on the cluster

Lifecycle of Spark Application

Standby

Lifecycle of Spark Application

Launch

  1. Driver process on the cluster begins running user code
    • User code must provide a SparkSession that initializes a Spark cluster (e.g., driver + executors)
  2. The SparkSession will subsequently communicate with the cluster manager, asking it to launch Spark executor processes across the cluster
  3. The cluster manager responds by launching the executor processes and sends the relevant information about their locations to the driver process

Lifecycle of Spark Application

Execution

  1. The driver and the workers communicate among themselves, executing code and moving data around
  2. The driver schedules tasks onto each worker, and each worker responds with the status of those tasks and success or failure

Lifecycle of Spark Application

Completion

  1. Once the Spark Application completes, the driver process exits with either success or failure
  2. The cluster manager then shuts down the executors in that Spark cluster for the driver, at which point you can see the success or failure of the Spark Application by asking the cluster manager

Part II

Spark Programming and Internals

Outline

  • RDDs
  • Spark Execution
  • Interlude: DataFrames
  • Spark Architecture II
  • PySpark Architecture
  • Interlude: DataFrames

  • Spark Programming

    • ​Structured APIs

    • RDDs

    • Distributed Variables

  • Spark Ecosystem

  • Interlude: Columnar Formats

  • Spark at TVision - Part II

    • EMR

    • Spark UI

  • Spark Optimization

    • Catalyst & Tungsten Internals

  • MapReduce Joins

  • Further Resources

Jeff Dean Facts

  • To Jeff Dean, "NP" means "No Problemo"

  • Jeff Dean's IDE doesn't do code analysis, it does code appreciation

  • Jeff Dean's PIN is the last 4 digits of pi

  • Google Search was Jeff Dean's N(ew G)oogler Project

  • Jeff Dean invented MapReduce so he could sort his fan mail

  • Emacs' preferred editor is Jeff Dean

  • Jeff Dean doesn't exist, he's actually an advanced AI created by Jeff Dean

  • Jeff Dean compiles and runs his code before submitting, but only to check for compiler and CPU bugs

RDD Fundementals

Resilient Distributed Dataset

  • Represent an immutable, partitioned collection of elements that can be operated on in parallel.
  • RDDs are made up of:
    • Partitions

      • Atomic pieces of the dataset. One or many per compute node

    • A function for computing the dataset based on its parent RDDs.
    • Dependencies

      • Models relationship between this RDD and its partitions with the RDD(s) it was derived from.

    • Metadata about it partitioning scheme and data placement

How to get an RDD?

  • Two ways
    • Parallelizing an existing collection in your driver program
      • SparkContext.parallelize()
    • Referencing a dataset in an external storage system, such as a HDFS, S3, Postgres
      • Data Source API
        • SparkContext.textFile('hdfs://data.txt')

RDDs

  • Computations on RDDs are represented as a lineage graph, a DAG representing the computations done on the RDD.
    • This representation/DAG is what Spark analyzes to do optimizations.
rdd = sc.textFile(...)
filtered = \
 rdd.map(...)\
    .filter(...)\
    .persist()
count = filtered.count()
reduced = filtered.reduce()

Ex. Recomputing RDDs

Failure

Ex. Recomputing RDDs

Recovery

Various RDDs

Example Program w/ RDDs

Problem:

You collect lots of application logs and would like to analyze error events.

 

Before you can do this, you need to remove rows corresponding to other events (INFO, DEBUG, etc).

 

You have a cluster at your disposal to do this processing. Write a driver program that collects and aggregates all 

error events.

Example from RDD Fundementals video

Example Program w/ RDDs

Error, ts, msg1
Warn,  ts, msg2
Error, ts, msg1
Info,  ts, msg8
Warn,  ts, msg2
Info,  ts, msg8
Error, ts, msg3
Info,  ts, msg5
Info,  ts, msg5
Error, ts, msg4
Warn,  ts, msg9
Error, ts, msg1
app.log

Driver Program

Cluster

Partitions

Error, ts, msg1
Warn,  ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg4
Warn,  ts, msg9
Error, ts, msg1
Error, ts, msg3
Info,  ts, msg5
Info,  ts, msg5
logLinesRDD = sc.textFile(
  "app.log", 
  minPartitions=4
)

Partition #1

Partition #2

Partition #3

Partition #4

Error, ts, msg1

Error, ts, msg1
Error, ts, msg4

Error, ts, msg1
Error, ts, msg3


errorsRDD = logLinesRDD.filter(
  lambda log: log[0] == "Error"
)

Partition #1

Partition #2

Partition #3

Partition #4

Error, ts, msg1

Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
cleanedRDD = errorsRDD.coalesce(2)

Partition #1

Partition #2

result = cleanedRDD.collect()
write_to_log("error.log", data=result)
$ cat error.log
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1

Example: On Driver

Error
Warn
Error
Info
Warn
Info
Error
Warn
Error
Error
Info
Info
Error

Error
Error

Error
Error


Error
Error

Error
Error
Error

app.log

on local driver

logLines

errors

cleaned

error.log

on local driver

Example: From HDFS to S3

Error
Warn
Error
Info
Warn
Info
Error
Warn
Error
Error
Info
Info
Error

Error
Error

Error
Error


Error
Error

Error
Error
Error

logLines

errors

cleaned

Block 1

 

Block 2

 

Block 3

 

Block 4

 

s3://err2.log

s3://err1.log

data Partition = Partition
  { partionIndex :: Int }
data Partitioner = 
  HashPartitioner | RangePartitioner | ..
data DependencyFlavor =
  Narrow | Shuffle | None
data Dependency f a = 
  { parent :: forall f. (RDD f) => f a
  , flavor :: DependencyFlavor }
class RDD f where
  {-# MINIMAL getDeps, getPartitions, compute #-}
  getPartitions  :: [Partition]
  compute        :: Partition -> TaskCtx -> [a]
  getDeps        :: [Dependency f a]
  getPreferedLoc :: Partition -> [Text]
  getPartitioner :: Maybe Partitioner

-- Data Source API  
parallelize  :: (RDD f) => [a] -> f a
-- Transformations
intersection :: (RDD f) => f a -> f a -> f a
cartesian    :: (RDD f) => f a -> f a -> f a
-- Actions
count        :: (RDD f) => f a -> Long
data TaskState = 
  Completed | Interrupted | RunningLocally
data TaskCtx = TaskCtx
  { state :: TaskState
  , attemptNum :: Int
  , partitionId :: Int
  , stageId :: Int, ..more config and state }

HadoopRDD

Method Implementation Note
Partitions One per HDFS block
Dependencies None Base/Input RDD
Compute Read corresponding block
Preferred Location HDFS block
Partitioner None Just partition per block, no repartitioning going on

FilteredRDD

Method Implementation Note
Partitions Same as parent
Dependencies One to one (narrow)
Compute Filter Go to parent's partition and filter it
Preferred Location None Ask parent
Partitioner None Probably parent partitioner

JoinedRDD

Method Implementation Note
Partitions One per reduce task
Dependencies Shuffle on each parent
Compute Read and join shuffled data
Preferred Location None (sometimes inherit) Typically has to get data over network.
Sometimes aligns to Parent RDD's location
Partitioner HashPartitioner

Specialized Connector RDDs

  • CassandraRDD
    • Pushdown predicate and projection
      • Push down filters into Cassandra so you only select columns/rows matching predicate
      • More on these kinds of optimizations later
    • Rather than reading full data into Spark and filtering after (ex. HDFS)

Spark Execution

  • Transformations 

    • instructions you provide Spark about how you would like to modify a DataFrame / RDD

    • Lazy, not evaluated until an action is called

    • Can be "narrow" or "wide"

  • Actions

    • instructions to Spark to compute a result from a series of transformations

    • Eager (force evaluation)

      • Upon calling an action, Spark creates, optimizes, and runs an execution plan.

  • Narrow Transformations
    • Transformations for which each input partition contributes to one output partition (AKA narrow dependencies)
  • Wide Transformations / Shuffles
    • Transformations for which input partitions contribute to many output partitions (AKA wide dependencies).
    • Whenever Spark performs a shuffle, it must write results to disk (AKA shuffle persistence).

Dependencies

Narrow or Wide?

  • You ask your friend for $100, who has exactly $100 to give you

$100

Can you lend me $100?

Narrow or Wide?

  • You ask your friend for $100, who has exactly $100 to give you

$100

Can you lend me $100?

Narrow!

Narrow or Wide?

  • You ask your friends for $100, each of whom gives you $25 of their $100

$100

$100

$100

$100

Can someone lend me $100?

$100

$25

$25

$25

$25

Narrow or Wide?

  • You ask your friends for $100, each of whom gives you $25 of their $100

$100

$100

$100

$100

Can someone lend me $100?

$100

$25

$25

$25

$25

Narrow!

Narrow or Wide?

  • You ask your friends for $50, two of whom have $50 to give you

$50

Can I borrow $50 each?

$50

$0

$0

$50

$50

Narrow or Wide?

  • You ask your friends for $50, two of whom have $50 to give you

$50

Can I borrow $50 each?

$50

$0

$0

$50

$50

Narrow!

Narrow or Wide?

  • You ask your friends, two of whom have $50, for $50 in $25 increments for each hand

$50

Can I borrow $50 each?

$50

$50

$50

$25

$25

$25

$25

Narrow or Wide?

  • You ask your friends, two of whom have $50, for $50 in $25 increments for each hand

$50

Can I borrow $50 each?

$50

$50

$50

Wide!

$25

$25

$25

$25

Ex. Model dependencies

Visualized DAG

Ex. Model dependencies

Resolved DAG

  • The B to G join is narrow because groupByKey already partitions the keys and places them appropriately in B after shuffling. Thus operations like join can sometimes be narrow and sometimes be wide.

Transformations

Transformations with (usually) Narrow dependencies:

  • map
  • mapValues
  • flatMap
  • filter
  • mapPartitions
  • mapPartitionsWithIndex

 

Transformations with (usually) Wide dependencies: (might cause a shuffle)

  • cogroup
  • groupWith
  • join
  • leftOuterJoin
  • rightOuterJoin
  • groupByKey
  • reduceByKey
  • combineByKey
  • distinct
  • intersection
  • repartition
  • coalesce

Example Spark Job

Narrow

Narrow

Narrow

Wide

Action

Given the following program, how should Spark execute it?

(How to divy tasks AKA computations on a partition)

Task per transition

Narrow

Narrow

Narrow

Wide

Action

Task per transition

  • Too many tasks
  • Lots of intermediate state
  • High Overhead
    • Each operation on i-th partition loops over input individually

Task per Output Partition

Narrow

Narrow

Narrow

Wide

Action

Task per Output Partition

  • Fewer tasks (4 < 12)
  • Less intermediate state
  • Each task has to do a lot more
  • Wide transformations / Shuffles
    • Have to recompute all input tasks if any part of shuffle fails 

Stages of Tasks per Shuffle

Narrow

Wide

Action

Stages of Tasks per Shuffle

  • Pipelining
    • operation that Spark automatically performs on narrow transformations that allows multiple transformations to be performed in-memory
      • AKA no data movement

Without Pipelining

With Pipelining

Spark Job Terms

 

  • Spark job
    • Each Spark application is made up of one or more Spark jobs. Spark jobs within an application are executed serially (unless you use threading to launch multiple actions in parallel).
    • Actions always return results. Each job breaks down into a series of stages, the number of which depends on how many shuffle operations need to take place.

Spark Job Terms

  • Stages - represent groups of tasks that can be executed together to compute the same operation on multiple machines.
    • In general, Spark will try to pack as much work as possible (i.e., as many transformations as possible inside your job) into the same stage, but the engine starts new stages after every shuffle.

Spark Job Terms

  • Tasks
    • A unit of computation applied to a unit of data (the partition). Each task corresponds to a combination of blocks of data and a set of transformations that will run on a single executor.
      • If there is one big partition in our dataset, we will have one task. If there are 1,000 little partitions, we will have 1,000 tasks that can be executed in parallel.
    • Partitioning your data into a greater number of partitions means that more can be executed in parallel

The Spark Shuffle

  • A physical repartitioning of the data
    • Ex. Sorting a DataFrame, or grouping data that was loaded from a file by key (which requires sending records with the same key to the same node).
      • This type of repartitioning requires coordinating across executors to move data around. Spark starts a new stage after each shuffle, and keeps track of what order the stages must run in to compute the final result.
  • ​Ex. reduce-by-key
    • ​ Where input data for each key needs to first be brought together from many nodes

Shuffle Steps

  1. “Source” tasks (those sending data) write shuffle files to their local disks during their execution stage.
  2. Grouping and reduction stage launches and runs tasks that fetch their corresponding records from each shuffle file and performs that computation
    • Ex. fetches and processes the data for a specific range of keys
  3. Saving the shuffle files to disk lets:
    • Spark run this stage later in time than the source stage
      • If there are not enough executors to run both at the same time the engine re-launch reduce tasks on failure without rerunning all the input tasks.

Shuffle Persistence

  • Shuffle Persistence
    • The step of saving files to disk
    • Allows new jobs running over over data that’s already been shuffled to skip re-running the “source” side of the shuffle.
      • Because the shuffle files were already written to disk earlier, Spark knows that it can use them to run the later stages of the job.

More on persistence

  • Shuffle Persistence
    • The step of saving files to disk
    • Allows new jobs running over over data that’s already been shuffled to skip re-running the “source” side of the shuffle.
      • Because the shuffle files were already written to disk earlier, Spark knows that it can use them to run the later stages of the job.

Spark Architecture

Part II

Spark Hardware Hierarchy

Spark Hardware Hierarchy

  • Cluster, Driver, and Executors
  • Cores / Slots
    • available threads to process partitions
    • NOT physical CPU cores on each machine (unfortunate terminology by Spark)
  • Working memory is utilized by Spark workloads
  • Disks used for:
    • Persistence to disks and spills for workload
    • Shuffle partitions for shuffle stages

Spark Software Hierarchy

Tasks = Cores = Slots

1 Task 1 Partition

1 Slot 1 Core

Stages

Jobs

Actions

Spark Software Hierarchy

Tasks = Cores = Slots

1 Task 1 Partition

1 Slot 1 Core

Stages

Jobs

Actions

  • Actions are eager
    • Made of transformations (lazy)
      • Narrow
      • Wide / Shuffle
    • Spawn jobs
      • Spawn stages
        • Spawn tasks
          • Do work and utilize hardware
            • Only part that uses hardware, rest for orchestration
          • All tasks in same stage do the same thing

Spark UI / History Server

PySpark Architecture

Pyspark

  • Thin library that sits on top of Java API, which sits on top of Scala core engine

Explanation from Pyspark Architecture video

Example: Pyspark Program

ex: collect()

ex: aws s3 cp

user Python code

Interlude: DataFrames

DataFrames vs SQL Tables

  • Table
    • a set of records (rows) / a relation
  • Transformations defined by relational algebra
    • "Protects users from needing to know how the data is organized in the machine, and makes it possible for users to specify high-level queries, and leads to an inexhaustible number of optimization techniques"

DataFrames vs SQL Tables

  • DataFrames
    •  Multiple definitions dependent on implementation

DataFrame APIs

Spark vs Pandas DataFrames

Pandas Spark
Column
Mutability Mutable Immutable
Add column
Rename column
Value Count
df['col']
df['col']
df['c'] = df['a'] + df['b']
df.withColumn(
    'c',
    df['a'] + df['b']
)
df.columns = ['a', 'b']
df.select(
    df['col1'].alias('a'),
    df['col2'].alias('b')
)
df['col'].value_counts()
df.groupBy(df['col']\
   .count()\
   .orderBy(
       'count', ascending=False)
)

Spark Programming

Spark Apis

  • Low Level
    • RDD
    • Distributed Variables
      • Broadcast vars
      • Accumulators
  • High Level (Structured APIs)
    • DataFrames
    • DataSets
  • Third Party

Spark Apis Compared

RDDs

  • Can be cached in-memory, which is a massive win for iterative algorithms
  • Type-safe in implementation language (Scala)
  • A lot like scala collections
    • Except distributed, lazy, immutable
def topHashtags(tweets: RDD[Tweet], n: Int
               ): -> Array[(String, Int)]
  tweets\
    .flatMap(lambda c: c.text.split("\\s+"))\ # split it into words
    .filter(lambda c: c.startsWith("#"))\     # filter hashtag words
    .map(lambda c: c.toLowerCase)\            # normalize hashtags
    .map(lambda c: (c, 1))\                   # create tuples for counting
    .reduceByKey(lambda a, b => a + b)\       # accumulate counters
    .top(n).sortBy(lambda c: c[1])            # return ordered top hashtags

Example from Quill article

RDDs

  • Catch errors at compile-time

Integer RDD

String RDD

Double RDD

When to use RDDs

  • Low-level API and control
  • Compile time typechecking
  • Low level API

Quill

  • https://medium.com/@fwbrasil/quill-spark-a-type-safe-scala-api-for-spark-sql-2672e8582b0d
    •  

Spark Ecosystem

Spark Ecosystem

  • Spark Libraries
    • MLlib
    • Spark Streaming
    • GraphFrames
      Graph processing using Cypher graph query language (Spark 3.0)
  • Third Party Libraries
    • Flint
      • Time-series Library for Spark

Interlude: Columnar

Title Text

Spark Optimization

What is Spark?

  • https://www.youtube.com/watch?v=RmUn5vHlevc
    • Explains catalyst (Video is fantastic)
      • Pure functions, fixed points, immutable trees, rewrites
    • Transformations
      • Two kinds
        • transform trees without changing the type of tree (ex. expression -> expression, logical plan -> logical plan, or physical plan -> physical plan)
        • transform tree into different type of tree (used for logical plan -> physical plan)
>>> import findspark
>>> findspark.init()
>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.appName('Spark Tech Talk').getOrCreate()

# Spark computation on two tables
>>> t1 = spark.range(2000000)
>>> t2 = spark.range(2000000)
>>> result = t1.join(t2, on=t1.id == t2.id).groupBy().count()

# # See execution plan
>>> result.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(5) Project
         +- *(5) SortMergeJoin [id#0L], [id#2L], Inner
            :- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(id#0L, 200)
            :     +- *(1) Range (0, 2000000, step=1, splits=12)
            +- *(4) Sort [id#2L ASC NULLS FIRST], false, 0
               +- ReusedExchange [id#2L], Exchange hashpartitioning(id#0L, 200)

# Takes a few seconds
>>> result.show()
+-------+
|  count|
+-------+
|2000000|
+-------+

Ex. Spark Plan and Execution

...
>>> result.explain(extended=True)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#19L]
+- Join Inner, (id#0L = id#2L)
   :- Range (0, 2000000, step=1, splits=Some(12))
   +- Range (0, 2000000, step=1, splits=Some(12))

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#19L]
+- Join Inner, (id#0L = id#2L)
   :- Range (0, 2000000, step=1, splits=Some(12))
   +- Range (0, 2000000, step=1, splits=Some(12))

== Optimized Logical Plan ==
Aggregate [count(1) AS count#19L]
+- Project
   +- Join Inner, (id#0L = id#2L)
      :- Range (0, 2000000, step=1, splits=Some(12))
      +- Range (0, 2000000, step=1, splits=Some(12))

== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)], output=[count#19L])
+- Exchange SinglePartition
   +- *(5) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#22L])
      +- *(5) Project
         +- *(5) SortMergeJoin [id#0L], [id#2L], Inner
            :- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(id#0L, 200)
            :     +- *(1) Range (0, 2000000, step=1, splits=12)
            +- *(4) Sort [id#2L ASC NULLS FIRST], false, 0
               +- ReusedExchange [id#2L], Exchange hashpartitioning(id#0L, 200)

Detailed Logical/Physical Plan

Spark SQL Engine

Spark SQL Engine

MapReduce Joins

Reduce-side Joins and Grouping

  • Sort-merge
  • GROUP BY
  • Skew Join / Sharded Join

Map-side Joins

  • Broadcast Hash Join
  • Partitioned Hash Join
  • Merge Join

Broadcast Hash Join

  • Different name
    • Map-side join - Hadoop community
    • Star-schema join
    • Replicated join
  • Join a large table (fact) with relatively small tables (dimensions) to avoid sending all data of the large table over the network

See Spark SQL Joins for basic, API level joins

See Mastering Spark SQL: Broadcast Joins for Spark broadcast details

Spark Joins

Title Text

Shuffle Merge Join

  • Was removed in favor of sort merge join in 1.6, but re-added in 2.0
    • ShuffledHashJoin is still useful when:

      • Any partition of the build side could fit in memory

      • The build side is much smaller than stream side, the building hash table on smaller side should be faster than sorting the bigger side.

    • Sort Merge Join is more robust

      • Shuffled Hash Join requires the hashed table to fit in memory

      • Sort Merge Join which can spill to disk