Apache Spark

for dummies

Yegor Bondar

KEY NOTES

Big Data, MapReduce, Hadoop, Distributed computing and other buzz words
Where Apache Spark comes from?
Apache Spark model & core concepts
Scala, Python and Java
Real world use cases
All right.. I want to start working with Apache Spark!
Conclusions

Buzz TErms that make hype

How to collect the "big" data?
How to store the "big" data?
How to process "big" data?
How to make the data "big" useful?

The main Big Data challenges:

5 Vs

Use cases

Batch processing

Use cases

Real Time Processing

Big Data History

Map-Reduce

Map-Reduce Alternative

Map-Reduce in Cluster

Apache Hadoop

an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. It consists of computer clusters built from commodity hardware.

Ecosystem

Spark vs hadoop

Apache Spark

Lightning-fast cluster computing

Main release year : 2014

http://spark.apache.org/

Apache Spark - Core Concepts

// Create Spark Configuration
val conf = new SparkConf()
               .setAppName("Spark App")
               .setMaster("local[*]")

// Create Spark Context
val sc = new SparkContext(conf)

// Read all lines from file
val lines = sc.textFile(filePath).cache()

// Apply transformations
val modifiedLines = lines.map(_.split("_"))

// Apply action
println(modifiedLine.count())

Rdd[T]

Resilient Distributed Dataset (aka RDD) is the primary data abstraction in Apache Spark and the core of Spark

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
Distributed with data residing on multiple nodes in a cluster.
Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

Rdd[T]

RDD[String]

RDD[(String, Int)]

RDD Transformations

transformation: RDD => RDD
transformation: RDD => Seq[RDD]

Transformations are lazy operations on a RDD that create one or many new RDDs, e.g. map, filter, reduceByKey, join, cogroup, randomSplit.

immutable Rdd nature

val r00 = sc.parallelize(0 to 9)
val r01 = sc.parallelize(0 to 90 by 10)
val r10 = r00 cartesian r01
val r11 = r00.map(n => (n, n))
val r12 = r00 zip r01
val r13 = r01.keyBy(_ / 20)
val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)

RDD ACTIONS

action: RDD => a value

Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program.

DAG Scheduler

DAG(Direct Acyclic Graph) is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan (i.e. RDD lineage of dependencies built using RDD transformations) to a physical execution plan

Word count - Hadoop

public class WordCount extends Configured implements Tool {

  public static void main(String args[]) throws Exception {
    int res = ToolRunner.run(new WordCount(), args);
    System.exit(res);
  }

  public int run(String[] args) throws Exception {
    //configuration stuff ....
    job.setMapperClass(Map.class);
    job.setCombinerClass(Reduce.class);
    job.setReducerClass(Reduce.class);

    return job.waitForCompletion(true) ? 0 : 1;
  }

  public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    public void map(LongWritable key, Text value,
                    Mapper.Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
      }
    }
  }
  public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable value : values) {
        sum += value.get();
      }
      context.write(key, new IntWritable(sum));
    }
  }
}

Word count - Spark

JavaRDD<String> textFile = sc.textFile("hdfs://...");
JavaPairRDD<String, Integer> counts = textFile
    .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
    .mapToPair(word -> new Tuple2<>(word, 1))
    .reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("hdfs://...");

val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

Real world use cases

Distributed processing of the large file sets
Process streaming data for different telecom network topologies
Running calculations on different datasets based on external meta configuration.

Distributed processing

Input

Set of binary data files stored in HDFS
Files have custom format
Data can be parallelized

Challenges

Parse custom data format
Calculate different aggregation values
Store result as JSON back to HDFS

Desired output

JSON objects which represents the average signal value for certain Web Mercator Grid zoom.

{
  "9#91#206" : {
    "(1468,3300,13)" : -96.65605103479673,
    "(1460,3302,13)" : -107.21621616908482,
    "(1464,3307,13)" : -97.89720813468034
  },
  "9#90#206" : {
    "(1447,3310,13)" : -113.03223673502605,
    "(1441,3301,13)" : -108.92957834879557
  },
  "9#90#207" : {
    "(1449,3314,13)" : -112.97138977050781,
    "(1444,3314,13)" : -115.83310953776042,
    "(1440,3313,13)" : -109.2352180480957
  }
}

Data Parallelizm

Implementation

  def prepareDataset(inputHolder: InputHolder,
processingCandidates: List[Info])(implicit sc: SparkContext): RDD[(Tile, GenSeq[GrdPointState])] = {
    sc.binaryFiles(inputHolder.getPath.getOrElse("hdfs://path/to/files"), partitions = 4).filter {
      case (grdPath, _) => processingCandidates.exists(inf => grdPath.contains(inf.path))
    }.flatMap {
      case (path, bytes) =>

        log(s"CoverageParser.prepareDataset - Loading dataset for $path")

        val grdInMemory = GdalGrdParser.gdal.allocateFile(bytes.toArray())

        val infOpt = GdalGrdParser.gdal.info(grdInMemory)

        val tileToPoints = ... // collect points from files on each node 

        tileToPoints
    }.reduceByKey(_ ++ _).mapValues(points => AveragedCollector.collect(Seq(points)))
  }

  def aggregateAndPersistTiles(inputHolder: InputHolder,
dataSet: RDD[(Tile, GenSeq[GrdPointState])])(implicit rootPath: String): Unit = {
    dataSet.mapPartitions { it =>
      it.toSeq.map {
        case (_, avgPoints) => ZoomedCollector.collectGrd(level.aggZoom)(avgPoints)
      }.iterator
    }.map { tileStates =>
      ZoomedCollector.collectTiles(level.groupZoom)(tileStates)
    }.map {
      tileStates =>
        tileStates.seq.toMap.asJson.noSpaces
    }.saveAsTextFile(s"$rootPath/tiles_result_${System.currentTimeMillis()}")
  }

Results

Dataset	Single JVM implementation	Spark Implementation (Local cluster)
14 M points	7 minutes & 2G Heap	1 minute
25 M points	12 minutes & 2G Heap	1.5 minute

Pros

Built in parallelism
Ways to improve performance

Cons

You should install 3rd party tool on each node to parse binary files

Data Stream Processing

Input

JSON messages from Apache Kafka topics
Each message represents the data from network topology element (cell, controller)
Aggregated JSON object should be persisted to either Kafka topic or HBase

Challenges

Aggregate different messages to build the final object
Process Kafka topics in efficient manner
Ensure reliability

Network Topology

{
  "technology" : "2g",
  "cell_id" : "10051",
  "site_id" : "UK1835",
  "cell_name" : "KA1051A",
  "latitude" : "49.14",
  "longitude" : "35.87777777777778",
  "antennaheight" : "49",
  "azimuth" : "35",
  "antennagain" : "17.5",
  "lac" : "56134",
  "Site name" : "UK1854"
}

{
  "SDCCH_NBR" : "23",
  "bts_id" : "5",
  "site_id" : "18043",
  "cell_id" : "10051",
  "bsc_id" : "1311",
  "technology" : "2G",
  "MCC" : "255",
  "element-meta" : "BTSID",
  "date" : "2016-10-31 03:03",
  "bcc" : "2",
  "SERVERID" : "259089846018",
  "HO_MSMT_PROC_MD" : "0",
  "element_type" : "BTS",
  "vendor" : "ZTE",
  "ncc" : "1",
  "cell_name" : "KA1051A"
}

Composed Key:

2G#KS#KHA#1311#18043#KA1051A

Composed Value:

Join of Input Jsons

BTS

BSC

Spark Streaming Design

Conclusion

Spark provides built-in Kafka support;
Spark provides built-in storage for caching messages;
It is easy to build re-try logic;
Easy to join different streams.

Meta Calculations

Input

Abstract dataset
User should be able to configure the flow how to process the dataset

Challenges

Apply custom function to transform the data
Apply different transformations based on the configuration
The engine should be flexible

Spark SQL

Spark SQL — Batch and Streaming Queries Over Structured Data on Massive Scale

Spark SQL performance

Implementation of SQL Queries

    @tailrec
    def applyStages(stages: List[KpiTransformationStage],
                    inputTableAlias: String, stage: Int, df: DataFrame): DataFrame = {
      stages match {
        case section :: xs =>
          val query = addFromClause(section.toSqlQuery, inputTableAlias)
          val outputRes = sqlContext.sql(query)
          outputRes.registerTempTable(s"Stage${stage}Output")
          applyStages(xs, s"Stage${stage}Output", stage + 1, outputRes)
        case Nil =>
          df
      }
    }

    // ...
    registerUdfs(params)(sqlContext)
    // ...
    dataFrame.registerTempTable("Input")

{
      "transformations": [
          {
            "name": "TOPOLOGY_KPI_MESH50_TOTAL_section",
            "stages": [
              {
                "sql": [
                  "select rowkey, nr_cells, nr_urban_cells, nr_total_cells,",
                  "getA(nr_small_cells, nr_total_cells) as aHealth,",
                  "getB(nr_dense_urban_cells, nr_total_cells) as bHealth"
                ]
              }
            ]
          }
      ]
}

Conlusions

Immutable nature of DataFrames/DataSets gives a lot of functionality;
You work in terms of 2D tables: Rows, Columns and Cells;
DataFrames performance is better than RDD;
UDF(User Defined Function) is powerful feature for custom transformations.

Starter Kit

Apache Spark can be used locally without any complex setup.

Add Spark library as a dependency
Run it in local[*] mode

val conf: SparkConf = new SparkConf()
                        .setMaster("local[*]")
                        .setAppName("TestApp")

val sc: SparkContext = new SparkContext(conf)

...

val lines = sc.textFile("src/main/resources/data/data.csv")

Starter Kit

Apache Jupiter, Apache Zeppelin

Conclusions

Big Data != Big amount of data;
Apache Spark is the replacement for Apache Hadoop in MapR frameworks;
Different use cases can be covered using built-in Apache Spark functionality.

Apache Spark

KEY NOTES

Buzz TErms that make hype

5 Vs

Use cases

Use cases

Big Data History

Map-Reduce

Map-Reduce Alternative

Map-Reduce in Cluster

Apache Hadoop

Ecosystem

Spark vs hadoop

Apache Spark

Apache Spark - Core Concepts

Rdd[T]

Rdd[T]

RDD Transformations

immutable Rdd nature

RDD ACTIONS

DAG Scheduler

Word count - Hadoop

Word count - Spark

Real world use cases

Distributed processing

Desired output

Data Parallelizm

Implementation

Results

Data Stream Processing

Network Topology

Spark Streaming Design

Conclusion

Meta Calculations

Spark SQL

Spark SQL performance

Implementation of SQL Queries

Conlusions

Starter Kit

Starter Kit

Conclusions

Useful resources