Apache Spark

@ TVision

Outline

Talk #1

What is Spark?
Spark Terminology
Why does Spark exist?
How do we use it?
- Reductions
- Impressions
- Bacchus
Spark Architecture - Part I
Spark Lifecycle
Spark Execution - Part I

Talk #2+

Distributed Systems
- Ecosystem & History
Interlude: DataFrames
Spark Programming
- Structured APIs, RDDs, Distributed Variables
Spark Architecture - Part II
Spark Execution - Part II
Spark Ecosystem
Interlude: Columnar Formats
Spark Optimization
- Catalyst & Tungsten Internals

Logistics

Lots of jargon
- Bold terms can be found in Spark Terminology and Concepts
Two part talk (maybe more)
- Part I
  - What Spark is, what problems it solves, some origin stories on how it was developed, and where it’s being used at TVision
- Part II
  - Programming with spark, Spark's features, internals, and the ecosystem

Part I

Spark Origins and Fundementals

What is Spark?

Let's look at the official site

Spark for Programmers

Framework for managing + coordinating tasks on data distributed across a cluster
Able to mix complex procedural operations and relational queries in a compositional way
Nice abstractions that make distributed programming easy to work with

Spark for FP

Spark has clearly been heavily influenced by modern functional programming
Provides simple, battle-tested solutions to a lot of industrial problems that languages like Haskell don't (yet)

Terminology Dump

Architecture

Cluster
- or group, of computers, that pools their resources together so that we can use all the cumulative resources as if they were a single computer
Cluster manager
- manages the cluster of machines that Spark uses to execute tasks.
  - Examples include Spark's standalone cluster manager, Hadoop Yarn, and Apache Mesos.
    - Spark is cluster manager agnostic. While it natively supports the Hadoop YARN cluster manager, it requires nothing from Hadoop itself.

Platform

Google File System
- a proprietary distributed file-system that stores data on commodity machines
MapReduce
- a programming model and implementation for processing and generating huge data sets with a parallel, distributed algorithm on a cluster
Hadoop
- Open source GFS/MR implementation
  - Hadoop Distributed File System (HDFS)
  - Hadoop YARN – a cluster manager
  - Hadoop MapReduce
- Spark does not provide it's own storage, runs on top of something like HDFS or S3 instead

Spark Platform

Spark
- Like Hadoop MapReduce, Spark is an open-source, distributed processing system. However, unlike Hadoop MapReduce, Spark uses directed acyclic graphs (DAG) for execution plans and in-memory caching for datasets.
PySpark
- API for writing spark in Python.
- If you’re using the structured APIs, your code should run just about as fast as if you had written it in Scala.
Spark SQL
- Provides DataFrames API + Catalyst optimizer

Spark Engine / "Compiler"

Catalyst
- Spark's logical planner and relational optimizer for operations on Structured APIs used to maintain its type information through planning and processing
Tungsten
- Spark physical planner and execution optimizer (after Catalyst optimization), which improves memory and CPU efficiency

Why does Spark exist?

What is Batch Processing?

Takes a large amount of input data, runs a job to process it, and produces some output data.
Jobs often take a while (ex. few minutes to days)
- Scheduled to run periodically
- The primary performance measure of a batch job is usually throughput
  - time to process input dataset of a certain size (ex. 99 GB/sec)

-- BatchJobScript.hs
proccessBigData :: BigInput -> BigOutput

-- batch.crontab
10 10 * * * ./run-batch-job.sh input.txt

Scaling Batch Processing

Example: In 2003, Google was indexing the web
- 20+ billion web pages * 20 KB/page = 400+ Terabytes of data

Batch Processing

On single computer

Ex. Process and store 400+ TB

Say they had 1 computer that could read 50 MB/sec from disk
- 3 months to read the web
~1000 disks just to store the web
- Even more to do anything with it

BAD

Scaling Batch Processing

On many computers

Ex. Process and store 400+ TB

1000 machines, < 3 hours

GOOD

Scaling Batch Processing

On many computers

Bad news:
- Programming complications
  - Communication and coordination
  - Recovering from machine failure
  - Status reporting
  - Debugging
  - Optimization
  - Locality
- Need to figure all of these for problem and solve again for any other processing problem you have

To solve problems at scale, paradoxically, you have to know the smallest details"

Alan Eustace (Former Engineering Head @ Google)

The Paradox of Scale

Necessary Fault Tolerance

Imagine your average computer stays up for 3 years before it experiences some hardware or operating system failure at which point it keels over. That’s not such a big deal, except if you are running a computation on 1000s of machines that takes on the order of a day.

You will run into some sort of failure during that computation. You have to be prepared for failure at the software level because when the computations are large enough, you will experience failures across machines.”

Jeff Dean (co-author of MapReduce, BigTable, Spanner, TensorFlow, The Universe)

Quote from https://youtu.be/quSmkZtty4o?t=392

MapReduce

A programming model and implementation for processing and generating huge data sets with a parallel, distributed algorithm on a cluster

ELI5: MapReduce

Example: Parallel program for counting all the books in a library
- Map
  - You count up shelf #1, I count up shelf #2
    - Whenever one of us finishes we move to next uncounted shelf
    - The more people we get, the faster it goes.
- Reduce
  - Now we get together and add our individual counts

Example from https://news.ycombinator.com/item?id=2849163

MapReduce

Provides reliable, scalable, maintainable way to process lots of data on lots of cheap, commodity software
- Write in a functional style
  - Map
    - Apply said function to distributed data (which spread many computers)
  - Reduce
    - Aggregate transformed data and get a result
  - Shuffle
    - Redistribute data on cluster

Example: Render Map Tiles

From Jeff Dean's Building Software Systems at Google and Lessons Learned lecture

Parallel MapReduce

From Jeff Dean's Building Software Systems at Google and Lessons Learned lecture

Slide from Jeff Dean's Building Software Systems at Google and Lessons Learned lecture

Reality of Machine Failure

If we hadn’t had to deal with failures (of computers), if we had a perfectly reliable set of computers to run (our code) on, we probably never would've implemented MapReduce. Without having failures, the support code (that MapReduce provides) just isn’t complicated.

Sanjay Ghemawat (co-author of GFS, MapReduce, BigTable)

Quote from https://youtu.be/quSmkZtty4o?t=339

Problem with MapReduce

Materialization of intermediate state
- Mappers are often redundant (by map/fold fusion)
  - They just read back the same file that was just written by a reducer, and prepare it for the next stage of partitioning and sorting. In many cases, the mapper code could be part of the previous reducer
- Overkill for temporary data
  - Storing in a distributed filesystem means those files are replicated across several nodes, which is often
- Jobs can only start when all tasks in the preceding jobs have completed. Waiting slows down the execution of the workflow as a whole.

MapReduce

HDFS

Input

Output

Intermediate Output

Spark

HDFS

Input

Output

Intermediate Output

Spark

Storing and reading data in memory is much faster buts adds a lot of complexity in a distributed setting
- Invented RDD abstraction to solve this

Need for Recomputation

Spark, Flink, and Tez avoid writing intermediate state to HDFS, so they take a different approach to tolerating faults
- If a machine fails and the intermediate state on that machine is lost, it is recomputed from other data that is still available
  - Intermediate state over input if possible
- Requires deterministic operations
To enable this recomputation, the framework must keep track of how a given piece of data was computed—which input partitions it used, and which operators were applied to it
- Spark uses RDDs for tracking ancestry of data

Complexity of MapReduce

MapReduce is still lower level than programmers like me want to write software with
- Don’t want to always have to reason about whether we need a broadcast hash join, map-side merge join, or whatever
- Implementing a complex processing job using the raw MapReduce APIs is actually quite hard and laborious
RDDs provide some help but not a lot

Spark SQL

Compositional interface for mixing complex procedural operations and relational queries
- DataFrame API
  - Can perform relational operations on both external data sources and Spark’s built-in distributed collections.
- Catalyst
  - Extensible optimizer

Towards Declarative Languages

Optimization #1

The choice of join algorithm can make a big difference to the performance of a batch job
- Spark, Flink, and Hive have query optimizers

Towards Declarative Languages

Optimization #2

Hive, Spark DataFrames, and Impala also use vectorized execution
- Iterating over data in a tight inner loop that is friendly to CPU caches, and avoiding function calls
Spark generates JVM bytecode and Impala uses LLVM to generate native code for these inner loops.

Towards Declarative Languages

Optimization #3

If a function contains only a simple filtering condition, or it just selects some fields from a record, then there is significant CPU overhead in calling the function on every record
- If such simple filtering and mapping operations are expressed in a declarative way, the query optimizer can take advantage of column-oriented storage layouts and read only the required columns from disk

Spark at TVision

Reduction

Transform raw content and presence (ACR and CV) into get second-by-second observations
- Determine who is watching and whether they're paying attention, and what they're watching

See RACR Flow and Huginn Reduction Flow

Spark + Tracker

See Ingest Tracker Spark Overview

Reduction

From Device to Backend to Redshift

Impressions

Ad and Program Impressions
- Next generation ranking build

Bacchus

Run ad-hoc spark jobs on flintrock (maybe kubernetes in future)
- Process ground truth videos
Cloud CV processing
- Run OpenVINO on nodes in cluster
End-to-end testing from videos to CV to reduced CV

Spark Architecture

Part I

Spark Architecture

Spark Application
- Job you want to run on Spark, which consists of a driver process and a set of executor processes
Driver
- The driver process runs your main() function, sits on a node in the cluster, and is responsible for:
  - Maintaining all relevant information during the lifetime of the Spark Application
  - Responding to a user’s program or input
  - Analyzing, distributing, and scheduling work across the executors
- It must interface with the cluster manager in order to actually get physical resources and launch executors.

Executors
- Responsible for actually carrying out the work that the driver assigns them. Each executor is responsible for:
  - Executing code assigned to it by the driver
  - Reporting the state of the computation on that executor back to the driver node (ie success/failure and results)
- Each Spark Application has its own separate executor processes.

Spark Session
- You control your Spark Application through a driver process called the SparkSession.
  - The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application.
- In Scala and Python, a Spark session is available as spark when you start Spark in the console / Spark Shell.

Spark Components

Communication

Spark Lifecycle

Lifecycle of Spark Application

Initiation

Request gets made to the cluster manager driver node asking for resources
- Asking for resources for the Spark driver process only
Cluster manager accepts this offer and places the driver onto a node in the cluster
The client process that submitted the original job exits and the application is off and running on the cluster

Lifecycle of Spark Application

Standby

Lifecycle of Spark Application

Launch

Driver process on the cluster begins running user code
- User code must provide a SparkSession that initializes a Spark cluster (e.g., driver + executors)
The SparkSession will subsequently communicate with the cluster manager, asking it to launch Spark executor processes across the cluster
The cluster manager responds by launching the executor processes and sends the relevant information about their locations to the driver process

Lifecycle of Spark Application

Execution

The driver and the workers communicate among themselves, executing code and moving data around
The driver schedules tasks onto each worker, and each worker responds with the status of those tasks and success or failure

Lifecycle of Spark Application

Completion

Once the Spark Application completes, the driver process exits with either success or failure
The cluster manager then shuts down the executors in that Spark cluster for the driver, at which point you can see the success or failure of the Spark Application by asking the cluster manager

Part II

Spark Programming and Internals

Outline

RDDs
Spark Execution
Interlude: DataFrames
Spark Architecture II
PySpark Architecture
Interlude: DataFrames
Spark Programming
- Structured APIs
- RDDs
- Distributed Variables
Spark Ecosystem

Interlude: Columnar Formats
Spark at TVision - Part II
- EMR
- Spark UI
Spark Optimization
- Catalyst & Tungsten Internals
MapReduce Joins
Further Resources

Jeff Dean Facts

To Jeff Dean, "NP" means "No Problemo"
Jeff Dean's IDE doesn't do code analysis, it does code appreciation
Jeff Dean's PIN is the last 4 digits of pi
Google Search was Jeff Dean's N(ew G)oogler Project
Jeff Dean invented MapReduce so he could sort his fan mail
Emacs' preferred editor is Jeff Dean
Jeff Dean doesn't exist, he's actually an advanced AI created by Jeff Dean
Jeff Dean compiles and runs his code before submitting, but only to check for compiler and CPU bugs

RDD Fundementals

Resilient Distributed Dataset

Represent an immutable, partitioned collection of elements that can be operated on in parallel.
RDDs are made up of:
- Partitions
  - Atomic pieces of the dataset. One or many per compute node
- A function for computing the dataset based on its parent RDDs.
- Dependencies
  - Models relationship between this RDD and its partitions with the RDD(s) it was derived from.
- Metadata about it partitioning scheme and data placement

How to get an RDD?

Two ways
- Parallelizing an existing collection in your driver program
  - SparkContext.parallelize()
- Referencing a dataset in an external storage system, such as a HDFS, S3, Postgres
  - Data Source API
    - SparkContext.textFile('hdfs://data.txt')

RDDs

Computations on RDDs are represented as a lineage graph, a DAG representing the computations done on the RDD.
- This representation/DAG is what Spark analyzes to do optimizations.

rdd = sc.textFile(...)
filtered = \
 rdd.map(...)\
    .filter(...)\
    .persist()
count = filtered.count()
reduced = filtered.reduce()

Ex. Recomputing RDDs

Failure

Ex. Recomputing RDDs

Recovery

Various RDDs

Example Program w/ RDDs

Problem:

You collect lots of application logs and would like to analyze error events.

Before you can do this, you need to remove rows corresponding to other events (INFO, DEBUG, etc).

You have a cluster at your disposal to do this processing. Write a driver program that collects and aggregates all

error events.

Example from RDD Fundementals video

Example Program w/ RDDs

Error, ts, msg1
Warn,  ts, msg2
Error, ts, msg1

Info,  ts, msg8
Warn,  ts, msg2
Info,  ts, msg8

Error, ts, msg3
Info,  ts, msg5
Info,  ts, msg5

Error, ts, msg4
Warn,  ts, msg9
Error, ts, msg1

app.log

Driver Program

Cluster

Partitions

Error, ts, msg1
Warn,  ts, msg2
Error, ts, msg1

Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8

Error, ts, msg4
Warn,  ts, msg9
Error, ts, msg1

Error, ts, msg3
Info,  ts, msg5
Info,  ts, msg5

logLinesRDD = sc.textFile(
  "app.log", 
  minPartitions=4
)

Partition #1

Partition #2

Partition #3

Partition #4

Error, ts, msg1

Error, ts, msg1

Error, ts, msg4

Error, ts, msg1

Error, ts, msg3

errorsRDD = logLinesRDD.filter(
  lambda log: log[0] == "Error"
)

Partition #1

Partition #2

Partition #3

Partition #4

Error, ts, msg1

Error, ts, msg1

Error, ts, msg3
Error, ts, msg4
Error, ts, msg1

cleanedRDD = errorsRDD.coalesce(2)

Partition #1

Partition #2

result = cleanedRDD.collect()
write_to_log("error.log", data=result)

$ cat error.log

Error, ts, msg1
Error, ts, msg1

Error, ts, msg3

Error, ts, msg4
Error, ts, msg1

Example: On Driver

Error
Warn
Error

Info
Warn
Info

Error
Warn
Error

Error
Info
Info

Error

Error

Error

Error

Error

Error
Error

Error
Error
Error

app.log

on local driver

logLines

errors

cleaned

error.log

on local driver

Example: From HDFS to S3

Error
Warn
Error

Info
Warn
Info

Error
Warn
Error

Error
Info
Info

Error

Error

Error

Error

Error

Error
Error

Error
Error
Error

logLines

errors

cleaned

Block 1

Block 2

Block 3

Block 4

s3://err2.log

s3://err1.log

data Partition = Partition
  { partionIndex :: Int }
data Partitioner = 
  HashPartitioner | RangePartitioner | ..
data DependencyFlavor =
  Narrow | Shuffle | None
data Dependency f a = 
  { parent :: forall f. (RDD f) => f a
  , flavor :: DependencyFlavor }

class RDD f where
  {-# MINIMAL getDeps, getPartitions, compute #-}
  getPartitions  :: [Partition]
  compute        :: Partition -> TaskCtx -> [a]
  getDeps        :: [Dependency f a]
  getPreferedLoc :: Partition -> [Text]
  getPartitioner :: Maybe Partitioner

-- Data Source API  
parallelize  :: (RDD f) => [a] -> f a
-- Transformations
intersection :: (RDD f) => f a -> f a -> f a
cartesian    :: (RDD f) => f a -> f a -> f a
-- Actions
count        :: (RDD f) => f a -> Long

data TaskState = 
  Completed | Interrupted | RunningLocally
data TaskCtx = TaskCtx
  { state :: TaskState
  , attemptNum :: Int
  , partitionId :: Int
  , stageId :: Int, ..more config and state }

HadoopRDD

Method	Implementation	Note
Partitions	One per HDFS block
Dependencies	None	Base/Input RDD
Compute	Read corresponding block
Preferred Location	HDFS block
Partitioner	None	Just partition per block, no repartitioning going on

FilteredRDD

Method	Implementation	Note
Partitions	Same as parent
Dependencies	One to one (narrow)
Compute	Filter	Go to parent's partition and filter it
Preferred Location	None	Ask parent
Partitioner	None	Probably parent partitioner

JoinedRDD

Method	Implementation	Note
Partitions	One per reduce task
Dependencies	Shuffle on each parent
Compute	Read and join shuffled data
Preferred Location	None (sometimes inherit)	Typically has to get data over network. Sometimes aligns to Parent RDD's location
Partitioner	HashPartitioner

Specialized Connector RDDs

CassandraRDD
- Pushdown predicate and projection
  - Push down filters into Cassandra so you only select columns/rows matching predicate
  - More on these kinds of optimizations later
- Rather than reading full data into Spark and filtering after (ex. HDFS)

Spark Execution

Transformations
- instructions you provide Spark about how you would like to modify a DataFrame / RDD
- Lazy, not evaluated until an action is called
- Can be "narrow" or "wide"
Actions
- instructions to Spark to compute a result from a series of transformations
- Eager (force evaluation)
  - Upon calling an action, Spark creates, optimizes, and runs an execution plan.

Narrow Transformations
- Transformations for which each input partition contributes to one output partition (AKA narrow dependencies)

Wide Transformations / Shuffles
- Transformations for which input partitions contribute to many output partitions (AKA wide dependencies).
- Whenever Spark performs a shuffle, it must write results to disk (AKA shuffle persistence).

Dependencies

Narrow or Wide?

You ask your friend for $100, who has exactly $100 to give you

$100

Can you lend me $100?

Narrow or Wide?

You ask your friend for $100, who has exactly $100 to give you

$100

Can you lend me $100?

Narrow!

Narrow or Wide?

You ask your friends for $100, each of whom gives you $25 of their $100

$100

Can someone lend me $100?

$100

$25

Narrow or Wide?

You ask your friends for $100, each of whom gives you $25 of their $100

$100

Can someone lend me $100?

$100

$25

Narrow!

Narrow or Wide?

You ask your friends for $50, two of whom have $50 to give you

$50

Can I borrow $50 each?

$50

Narrow or Wide?

You ask your friends for $50, two of whom have $50 to give you

$50

Can I borrow $50 each?

$50

Narrow!

Narrow or Wide?

You ask your friends, two of whom have $50, for $50 in $25 increments for each hand

$50

Can I borrow $50 each?

$50

$25

Narrow or Wide?

You ask your friends, two of whom have $50, for $50 in $25 increments for each hand

$50

Can I borrow $50 each?

$50

Wide!

$25

Ex. Model dependencies

Visualized DAG

Ex. Model dependencies

Resolved DAG

The B to G join is narrow because groupByKey already partitions the keys and places them appropriately in B after shuffling. Thus operations like join can sometimes be narrow and sometimes be wide.

Transformations

Transformations with (usually) Narrow dependencies:

map
mapValues
flatMap
filter
mapPartitions
mapPartitionsWithIndex

Transformations with (usually) Wide dependencies: (might cause a shuffle)

cogroup
groupWith
join
leftOuterJoin
rightOuterJoin
groupByKey
reduceByKey
combineByKey
distinct
intersection
repartition
coalesce

Example Spark Job

Narrow

Wide

Action

Given the following program, how should Spark execute it?

(How to divy tasks AKA computations on a partition)

Task per transition

Narrow

Wide

Action

Task per transition

Too many tasks
Lots of intermediate state
High Overhead
- Each operation on i-th partition loops over input individually

Task per Output Partition

Narrow

Wide

Action

Task per Output Partition

Fewer tasks (4 < 12)
Less intermediate state
Each task has to do a lot more
Wide transformations / Shuffles
- Have to recompute all input tasks if any part of shuffle fails

Stages of Tasks per Shuffle

Narrow

Wide

Action

Stages of Tasks per Shuffle

Pipelining
- operation that Spark automatically performs on narrow transformations that allows multiple transformations to be performed in-memory
  - AKA no data movement

Without Pipelining

With Pipelining

Spark Job Terms

Spark job
- Each Spark application is made up of one or more Spark jobs. Spark jobs within an application are executed serially (unless you use threading to launch multiple actions in parallel).
- Actions always return results. Each job breaks down into a series of stages, the number of which depends on how many shuffle operations need to take place.

Spark Job Terms

Stages - represent groups of tasks that can be executed together to compute the same operation on multiple machines.
- In general, Spark will try to pack as much work as possible (i.e., as many transformations as possible inside your job) into the same stage, but the engine starts new stages after every shuffle.

Spark Job Terms

Tasks
- A unit of computation applied to a unit of data (the partition). Each task corresponds to a combination of blocks of data and a set of transformations that will run on a single executor.
  - If there is one big partition in our dataset, we will have one task. If there are 1,000 little partitions, we will have 1,000 tasks that can be executed in parallel.
- Partitioning your data into a greater number of partitions means that more can be executed in parallel

The Spark Shuffle

A physical repartitioning of the data
- Ex. Sorting a DataFrame, or grouping data that was loaded from a file by key (which requires sending records with the same key to the same node).
  - This type of repartitioning requires coordinating across executors to move data around. Spark starts a new stage after each shuffle, and keeps track of what order the stages must run in to compute the final result.
Ex. reduce-by-key
- Where input data for each key needs to first be brought together from many nodes

Shuffle Steps

“Source” tasks (those sending data) write shuffle files to their local disks during their execution stage.
Grouping and reduction stage launches and runs tasks that fetch their corresponding records from each shuffle file and performs that computation
- Ex. fetches and processes the data for a specific range of keys
Saving the shuffle files to disk lets:
- Spark run this stage later in time than the source stage
  - If there are not enough executors to run both at the same time the engine re-launch reduce tasks on failure without rerunning all the input tasks.

Shuffle Persistence

Shuffle Persistence
- The step of saving files to disk
- Allows new jobs running over over data that’s already been shuffled to skip re-running the “source” side of the shuffle.
  - Because the shuffle files were already written to disk earlier, Spark knows that it can use them to run the later stages of the job.

More on persistence

Shuffle Persistence
- The step of saving files to disk
- Allows new jobs running over over data that’s already been shuffled to skip re-running the “source” side of the shuffle.
  - Because the shuffle files were already written to disk earlier, Spark knows that it can use them to run the later stages of the job.

Spark Architecture

Part II

Spark Hardware Hierarchy

Spark Hardware Hierarchy

Cluster, Driver, and Executors
Cores / Slots
- available threads to process partitions
- NOT physical CPU cores on each machine (unfortunate terminology by Spark)
Working memory is utilized by Spark workloads
Disks used for:
- Persistence to disks and spills for workload
- Shuffle partitions for shuffle stages

Spark Software Hierarchy

Tasks = Cores = Slots

1 Task 1 Partition

1 Slot 1 Core

Stages

Jobs

Actions

Spark Software Hierarchy

Tasks = Cores = Slots

1 Task 1 Partition

1 Slot 1 Core

Stages

Jobs

Actions

Actions are eager
- Made of transformations (lazy)
  - Narrow
  - Wide / Shuffle
- Spawn jobs
  - Spawn stages
    - Spawn tasks
      - Do work and utilize hardware
        
        Only part that uses hardware, rest for orchestration
      - All tasks in same stage do the same thing

Spark UI / History Server

Documentation in Confluence

Go to history server on QA Cluster (EMR): https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1#cluster-details:j-1RT410A48AI05

PySpark Architecture

Pyspark

Thin library that sits on top of Java API, which sits on top of Scala core engine

Explanation from Pyspark Architecture video

Example: Pyspark Program

ex: collect()

ex: aws s3 cp

user Python code

Interlude: DataFrames

DataFrames vs SQL Tables

Table
- a set of records (rows) / a relation
Transformations defined by relational algebra
- "Protects users from needing to know how the data is organized in the machine, and makes it possible for users to specify high-level queries, and leads to an inexhaustible number of optimization techniques"

DataFrames vs SQL Tables

DataFrames
- Multiple definitions dependent on implementation

DataFrame APIs

R DataFrames
Python Pandas
Haskell Frames
Spark DataFrames
- Koalas Pandas on Spark DataFrames
  - https://github.com/databricks/koalas

Spark vs Pandas DataFrames

	Pandas	Spark
Column
Mutability	Mutable	Immutable
Add column
Rename column
Value Count

Slide from Announcing Koalas Open Source Project

df['col']

df['col']

df['c'] = df['a'] + df['b']

df.withColumn(
    'c',
    df['a'] + df['b']
)

df.columns = ['a', 'b']

df.select(
    df['col1'].alias('a'),
    df['col2'].alias('b')
)

df['col'].value_counts()

df.groupBy(df['col']\
   .count()\
   .orderBy(
       'count', ascending=False)
)

Spark Programming

Spark Apis

Low Level
- RDD
- Distributed Variables
  - Broadcast vars
  - Accumulators
High Level (Structured APIs)
- DataFrames
- DataSets
Third Party
- Frameless
- Quill

Spark Apis Compared

From Databrick's A Tale of Three Apache Spark APIs

RDDs

Can be cached in-memory, which is a massive win for iterative algorithms
Type-safe in implementation language (Scala)
A lot like scala collections
- Except distributed, lazy, immutable

def topHashtags(tweets: RDD[Tweet], n: Int
               ): -> Array[(String, Int)]
  tweets\
    .flatMap(lambda c: c.text.split("\\s+"))\ # split it into words
    .filter(lambda c: c.startsWith("#"))\     # filter hashtag words
    .map(lambda c: c.toLowerCase)\            # normalize hashtags
    .map(lambda c: (c, 1))\                   # create tuples for counting
    .reduceByKey(lambda a, b => a + b)\       # accumulate counters
    .top(n).sortBy(lambda c: c[1])            # return ordered top hashtags

Example from Quill article

RDDs

Catch errors at compile-time

Integer RDD

String RDD

Double RDD

When to use RDDs

Low-level API and control
Compile time typechecking
Low level API

Quill

https://medium.com/@fwbrasil/quill-spark-a-type-safe-scala-api-for-spark-sql-2672e8582b0d

Spark Ecosystem

Spark Libraries
- MLlib
- Spark Streaming
- GraphFrames
  Graph processing using Cypher graph query language (Spark 3.0)
Third Party Libraries
- Flint
  - Time-series Library for Spark

Interlude: Columnar

Title Text

https://towardsdatascience.com/demystify-hadoop-data-formats-avro-orc-and-parquet-e428709cf3bb
- Row based better for write-heavy disk, since appending is easier
Parquet
- on disk
  - record shredding and assembly algo based off Dremel
Arrow
- In-memory

Spark Optimization

What is Spark?

https://www.youtube.com/watch?v=RmUn5vHlevc
- Explains catalyst (Video is fantastic)
- - Pure functions, fixed points, immutable trees, rewrites
- Transformations
  - Two kinds
    - transform trees without changing the type of tree (ex. expression -> expression, logical plan -> logical plan, or physical plan -> physical plan)
    - transform tree into different type of tree (used for logical plan -> physical plan)

>>> import findspark
>>> findspark.init()
>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.appName('Spark Tech Talk').getOrCreate()

# Spark computation on two tables
>>> t1 = spark.range(2000000)
>>> t2 = spark.range(2000000)
>>> result = t1.join(t2, on=t1.id == t2.id).groupBy().count()

# # See execution plan
>>> result.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(5) Project
         +- *(5) SortMergeJoin [id#0L], [id#2L], Inner
            :- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(id#0L, 200)
            :     +- *(1) Range (0, 2000000, step=1, splits=12)
            +- *(4) Sort [id#2L ASC NULLS FIRST], false, 0
               +- ReusedExchange [id#2L], Exchange hashpartitioning(id#0L, 200)

# Takes a few seconds
>>> result.show()
+-------+
|  count|
+-------+
|2000000|
+-------+

Ex. Spark Plan and Execution

...
>>> result.explain(extended=True)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#19L]
+- Join Inner, (id#0L = id#2L)
   :- Range (0, 2000000, step=1, splits=Some(12))
   +- Range (0, 2000000, step=1, splits=Some(12))

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#19L]
+- Join Inner, (id#0L = id#2L)
   :- Range (0, 2000000, step=1, splits=Some(12))
   +- Range (0, 2000000, step=1, splits=Some(12))

== Optimized Logical Plan ==
Aggregate [count(1) AS count#19L]
+- Project
   +- Join Inner, (id#0L = id#2L)
      :- Range (0, 2000000, step=1, splits=Some(12))
      +- Range (0, 2000000, step=1, splits=Some(12))

== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)], output=[count#19L])
+- Exchange SinglePartition
   +- *(5) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#22L])
      +- *(5) Project
         +- *(5) SortMergeJoin [id#0L], [id#2L], Inner
            :- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(id#0L, 200)
            :     +- *(1) Range (0, 2000000, step=1, splits=12)
            +- *(4) Sort [id#2L ASC NULLS FIRST], false, 0
               +- ReusedExchange [id#2L], Exchange hashpartitioning(id#0L, 200)

Detailed Logical/Physical Plan

Spark SQL Engine

From declarative queries to RDDs
Understanding Query Plans and Spark UIs

MapReduce Joins

Reduce-side Joins and Grouping

Sort-merge
GROUP BY
Skew Join / Sharded Join

Map-side Joins

Broadcast Hash Join
Partitioned Hash Join
Merge Join

Broadcast Hash Join

Different name
- Map-side join - Hadoop community
- Star-schema join
- Replicated join
Join a large table (fact) with relatively small tables (dimensions) to avoid sending all data of the large table over the network

See Map-Side Join in Spark

See Spark SQL Joins for basic, API level joins

See Mastering Spark SQL: Broadcast Joins for Spark broadcast details

Spark Joins

Title Text

Optimizing Apache Spark SQL Joins
- Basic
  - Shuffle Hash Join
  - Broadcast Hash Join
  - Cartesian Join
- Special
  - Theta Join
  - One to Many Join
Working with Skewed Data: The Iterative Broadcast
- Iterative Broadcast

Shuffle Merge Join

Was removed in favor of sort merge join in 1.6, but re-added in 2.0
- ShuffledHashJoin is still useful when:
  - Any partition of the build side could fit in memory
  - The build side is much smaller than stream side, the building hash table on smaller side should be faster than sorting the bigger side.
- Sort Merge Join is more robust
  - Shuffled Hash Join requires the hashed table to fit in memory
  - Sort Merge Join which can spill to disk