Spark architecture

In local installation, cores serve as master & slaves

Hardware organization

spatial software organization

The driver runs on the master
It executes the "main()" code of your program.

The Cluster Master manages the computation resources.
The each worker manages a single core.

Each RDD is partitioned among the workers,
Workers manage partitions and Executors
Executors execute tasks on their partition, are myopic.

spatial organization (more detail)

SparkContext (sc) is the abstraction that encapsulates the cluster for the driver node (and the programmer).
Worker nodes manage resources in a single slave machine.
Worker nodes communicate with the cluster manager.
Executors are the processes that can perform tasks.
Cache refers to the local memory on the slave machine.

materialization

Consider RDD1
-> Map (x: x*x) -> RDD2
->Reduce (x,y:x+y)-> float (in head node)
RDD1 -> RDD2 is a lineage
RDD2 can be consumed as it is being generated.
Does not have to be materialized = stored in memory

RDD Processing

RDDs, by default, are not materialized
They do materialize if cached or otherwise persisted.

Temporal organization
RDD Graph and Physical plan

Recall Spatial

organization

A stage ends

when the RDD needs

to be materialized

Terms and concepts of execution

RDDs are partitioned across workers.
RDD graph defines the Lineage of the RDDs.
SparkContext divides the RDD graph into stages which define the execution plan (or physical plan)
A task corresponds to the to one stage,
restricted to one partition.
An executor is a process that performs tasks.

Summary

Spark computation is broken into tasks
Spatial organization: different data
partitions on different machine
Temporal organization: Computation
is broken into stages. a sequence of stages.
Next: persistence and checkpointing

Persistance

and Checkpointing

Levels of persistance

Caching is useful for retaining intermediate results
On the other hand, caching can consume a lot of memory
If memory is exhausted, caches can be eliminated, spilled to disk etc.
If needed again, cache is recomputed or read from disk.
The generalization of .cache() is called .persist() which has many options.

Storage Levels

.cache() same as .persist(MEMORY_ONLY)

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

Checkpointing

Spark is fault tolerant. If a slave machine crashes, it's RDD's will be recomputed.
If hours of computation have been completed before the crash, all the computation needs to be redone.
Checkpointing reduces this problem by storing the materialized RDD on a remote disk.
On Recovery, the RDD will be recovered from the disk.
It is recommended to cache an RDD before checkpointing it.

http://takwatanabe.me/pyspark/generated/generated/pyspark.RDD.checkpoint.html

Checkpoint vs. Persist to disk

Cache materializes the RDD and keeps it in memory (and/or disk). But the lineage（computing chain）of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated.
Checkpoint saves the RDD to an HDFS file and actually forgets the lineage completely. This allows long lineages to be truncated and the data to be saved reliably in HDFS, which is naturally fault-tolerant by replication.
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md

Spark architecture

Hardware organization

spatial software organization

spatial organization (more detail)

materialization

RDD Processing

Temporal organization
RDD Graph and Physical plan

Terms and concepts of execution

Summary

Persistance

and Checkpointing

Levels of persistance

Storage Levels

Checkpointing

Checkpoint vs. Persist to disk

Spark Architecture

Spark Architecture

Yoav Freund

Spark architecture

Hardware organization

spatial software organization

spatial organization (more detail)

materialization

RDD Processing

Temporal organization RDD Graph and Physical plan

Terms and concepts of execution

Summary

Persistance

and Checkpointing

Levels of persistance

Storage Levels

Checkpointing

Checkpoint vs. Persist to disk

Spark Architecture

More from Yoav Freund

Temporal organization
RDD Graph and Physical plan