Apache Spark

Lightning-fast cluster computing

What is spark:
is a fast and general engine for large-scale data processing

When to use it:

  • When you have terabytes/petabytes of data
  • When your algorithms work in cycles 
  • When you need support for live data processing

Spark modules
 

  1. Spark streaming:
    Can receive live data, divide it into batches and generate the results, works with Kafka, Flumes, Hadoop 
  2. Spark Sql:
    Offers a abstraction for sql table or other data sources:
    DataFrames, we can run sql commands on the or work as RDD's
  3. Mlib- machine learning, random data generation....
  4. GraphX - introduces a new abstraction on top of RDD: Graph
    as well a wide list of algorithms and builders for graphs.
  5. Bagel : google's implementations of graphs - will be merged with graphX

Module's

RDD Spark's most important abstraction

(Resilient distributed datasets)

Fault tolerant thanks to lineage!!

2 types of operations that can be applied on RDD's

  1. Actions: that returns values 
    reduce, collect, count...
  2. Transformations: that returns another RDD
    map, reduceByKey, filter...

Some operations trigger a shuffle task.
That needs to reorganize the data on nodes(really expensive it involves I/O and networking.

(join, cogroup, all by key operations)

All function passed to operations should be commutative and associative!!!!

Important notes:

  • All operations are lazy
  • We can persist the data in memory
  • Don't use state from the scope of the program(broadcast and accumulator variables)
  • Data is saved on disk in hdfs or others formats.
  • We can use checkpoints that save on disk/memory , spark checks if the data is not on disk or in memory and only after that generates the rdd.
  • Computations are made on different nodes, the  work is divided in tasks(for each partition 1 task) each task has it's own JVM jobs.

Apache Spark

By Corneliu Caraman