Lightning-fast cluster computing
What is spark:
is a fast and general engine for large-scale data processing
When to use it:
Spark modules
Module's
RDD Spark's most important abstraction
(Resilient distributed datasets)
Fault tolerant thanks to lineage!!
2 types of operations that can be applied on RDD's
Some operations trigger a shuffle task.
That needs to reorganize the data on nodes(really expensive it involves I/O and networking.
(join, cogroup, all by key operations)
All function passed to operations should be commutative and associative!!!!
Important notes: