Apache Spark

usage in the real world

Yegor Bondar

KEY NOTES

Real world use cases
All right.. I want to start working with Apache Spark!
Conclusions

Use cases

Batch processing

Use cases

Real Time Processing

Real world use cases

Distributed processing of the large file sets
Process streaming data for different telecom network topologies
Running calculations on different datasets based on external meta configuration.

Distributed processing

Input

Set of binary data files stored in HDFS
Each file represents the geodata + network cell values
Files have custom format
Data can be parallelized

Challenges

Parse custom data format
Calculate different aggregation values
Store result as JSON back to HDFS

Desired output

JSON objects which represents the average signal value for certain Web Mercator Grid zoom.

{
  "9#91#206" : {
    "(1468,3300,13)" : -96.65605103479673,
    "(1460,3302,13)" : -107.21621616908482,
    "(1464,3307,13)" : -97.89720813468034
  },
  "9#90#206" : {
    "(1447,3310,13)" : -113.03223673502605,
    "(1441,3301,13)" : -108.92957834879557
  },
  "9#90#207" : {
    "(1449,3314,13)" : -112.97138977050781,
    "(1444,3314,13)" : -115.83310953776042,
    "(1440,3313,13)" : -109.2352180480957
  }
}

Data Parallelizm

Implementation

  def prepareDataset(inputHolder: InputHolder,
processingCandidates: List[Info])(implicit sc: SparkContext): RDD[(Tile, GenSeq[GrdPointState])] = {
    sc.binaryFiles(inputHolder.getPath.getOrElse("hdfs://path/to/files"), partitions = 4).filter {
      case (grdPath, _) => processingCandidates.exists(inf => grdPath.contains(inf.path))
    }.flatMap {
      case (path, bytes) =>

        log(s"CoverageParser.prepareDataset - Loading dataset for $path")

        val grdInMemory = GdalGrdParser.gdal.allocateFile(bytes.toArray())

        val infOpt = GdalGrdParser.gdal.info(grdInMemory)

        val tileToPoints = ... // collect points from files on each node 

        tileToPoints
    }.reduceByKey(_ ++ _).mapValues(points => AveragedCollector.collect(Seq(points)))
  }

  def aggregateAndPersistTiles(inputHolder: InputHolder,
dataSet: RDD[(Tile, GenSeq[GrdPointState])])(implicit rootPath: String): Unit = {
    dataSet.mapPartitions { it =>
      it.toSeq.map {
        case (_, avgPoints) => ZoomedCollector.collectGrd(level.aggZoom)(avgPoints)
      }.iterator
    }.map { tileStates =>
      ZoomedCollector.collectTiles(level.groupZoom)(tileStates)
    }.map {
      tileStates =>
        tileStates.seq.toMap.asJson.noSpaces
    }.saveAsTextFile(s"$rootPath/tiles_result_${System.currentTimeMillis()}")
  }

Results

Dataset	Single JVM implementation	Spark Implementation (Local cluster)
14 M points	7 minutes & 2G Heap	1 minute
25 M points	12 minutes & 2G Heap	1.5 minute

Pros

Built in parallelism
Ways to improve performance

Cons

You should install 3rd party tool on each node to parse binary files

Data Stream Processing

Input

JSON messages from Apache Kafka topics
Each message represents the data from network topology element (cell, controller)
Aggregated JSON object should be persisted to either Kafka topic or HBase

Challenges

Aggregate different messages to build the final object
Process Kafka topics in efficient manner
Ensure reliability

Network Topology

{
  "technology" : "2g",
  "cell_id" : "10051",
  "site_id" : "UK1835",
  "cell_name" : "KA1051A",
  "latitude" : "49.14",
  "longitude" : "35.87777777777778",
  "antennaheight" : "49",
  "azimuth" : "35",
  "antennagain" : "17.5",
  "lac" : "56134",
  "Site name" : "UK1854"
}

{
  "SDCCH_NBR" : "23",
  "bts_id" : "5",
  "site_id" : "18043",
  "cell_id" : "10051",
  "bsc_id" : "1311",
  "technology" : "2G",
  "MCC" : "255",
  "element-meta" : "BTSID",
  "date" : "2016-10-31 03:03",
  "bcc" : "2",
  "SERVERID" : "259089846018",
  "HO_MSMT_PROC_MD" : "0",
  "element_type" : "BTS",
  "vendor" : "ZTE",
  "ncc" : "1",
  "cell_name" : "KA1051A"
}

Composed Key:

2G#KS#KHA#1311#18043#KA1051A

Composed Value:

Join of Input Jsons

BTS

BSC

Output

Spark Streaming Design

Title Text

  type Record = Map[String, String]

  def process(config: Map[String, String])(implicit ssc: StreamingContext): Unit = {

    val primary = primaryStream(ssc)
    val secondary = secondaryStream(ssc)

    val cacheElement = getCacheInstance()

    transformStream(primary, secondary, ssc).foreachRDD { rdd =>

      logger.info(s"TopologyUpdateService.process - New RDD[${rdd.getNumPartitions}] Empty[${rdd.isEmpty}].")

      val toRetry = for {
        (_, (primarySeq, secondaryOpt)) <- rdd if secondaryOpt.isEmpty && primarySeq.forall(_.isSecondaryRequired)
        rec <- primarySeq
      } yield rec

      val toPersist = for {
        (_, (primarySeq, secondaryOpt)) <- rdd if secondaryOpt.nonEmpty || !primarySeq.forall(_.isSecondaryRequired)
        rec <- primarySeq
      } yield rec.enrich(secondaryOpt.asInstanceOf[Option[SecondaryRecord]])

      persist(mapToOutput(toPersist))

      toRetry.foreachPartition { it =>
        it.foreach { el =>
          cacheElement.cache("retry", el.data)
        }
      }

    }

  }

Stream processing code Snippet

Conclusion

Spark provides built-in Kafka support;
Spark provides built-in storage for caching messages;
It is easy to build re-try logic;
Easy to join different streams.

Meta Calculations

Input

Abstract dataset
User should be able to configure the flow how to process the dataset

Challenges

Apply custom function to transform the data
Apply different transformations based on the configuration
The engine should be flexible

Spark SQL

Spark SQL — Batch and Streaming Queries Over Structured Data on Massive Scale

Spark SQL performance

Implementation of SQL Queries

    @tailrec
    def applyStages(stages: List[KpiTransformationStage],
                    inputTableAlias: String, stage: Int, df: DataFrame): DataFrame = {
      stages match {
        case section :: xs =>
          val query = addFromClause(section.toSqlQuery, inputTableAlias)
          val outputRes = sqlContext.sql(query)
          outputRes.registerTempTable(s"Stage${stage}Output")
          applyStages(xs, s"Stage${stage}Output", stage + 1, outputRes)
        case Nil =>
          df
      }
    }

    // ...
    registerUdfs(params)(sqlContext)
    // ...
    dataFrame.registerTempTable("Input")

{
      "transformations": [
          {
            "name": "TOPOLOGY_KPI_section",
            "stages": [
              {
                "sql": [
                  "select rowkey, nr_cells, nr_urban_cells, nr_total_cells,",
                  "getA(nr_small_cells, nr_total_cells) as aHealth,",
                  "getB(nr_dense_urban_cells, nr_total_cells) as bHealth"
                ]
              }
            ]
          }
      ]
}

Conlusions

Immutable nature of DataFrames/DataSets gives a lot of functionality;
You work in terms of 2D tables: Rows, Columns and Cells;
DataFrames performance is better than RDD;
UDF(User Defined Function) is powerful feature for custom transformations.

Starter Kit

Apache Spark can be used locally without any complex setup.

Add Spark library as a dependency
Run it in local[*] mode

val conf: SparkConf = new SparkConf()
                        .setMaster("local[*]")
                        .setAppName("TestApp")

val sc: SparkContext = new SparkContext(conf)

...

val lines = sc.textFile("src/main/resources/data/data.csv")

Starter Kit

Apache Jupiter, Apache Zeppelin

Conclusions

Big Data not always Big amount of data;
Apache Spark is the replacement for Apache Hadoop in MapR frameworks;
Different use cases can be covered using built-in Apache Spark functionality.

apache-spark-realworld

By Yegor Bondar

apache-spark-realworld

Apache Spark usage in the real world

Apache Spark

KEY NOTES

Use cases

Use cases

Real world use cases

Distributed processing

Desired output

Data Parallelizm

Implementation

Results

Data Stream Processing

Network Topology

Spark Streaming Design

Title Text

Stream processing code Snippet

Conclusion

Meta Calculations

Spark SQL

Spark SQL performance

Implementation of SQL Queries

Conlusions

Starter Kit

Starter Kit

Conclusions

apache-spark-realworld

More from Yegor Bondar