Introduction to Apache Spark

Jason Foster - Orion Healthcare

About Me

  • Senior Software Engineer at Orion Health
  • 10+ years in Healthcare IT, 20+ years in engineering
  • Variety of other industry experience, including Mutual Funds, Telecom and HR
  • Part of team developing BI & Analytics platform at Orion

Orion Health is a global, independently owned eHealth software company with proven experience in delivering interoperable, connected solutions for healthcare facilities, organizations and regions.

 

The Scottsdale location is focused on BI & Analytics on a Big Data platform.

About Orion Health

High-Level Agenda

  • Background, Components and Architecture
  • Key Concepts and Spark Applications
  • Installation and Spark Tooling
  • Spark Programming
  • Real-World Use Case

What is Apache Spark?

  • Open-Source cluster computing framework for data analytics
  • Originally developed in AMPLab at UC Berkely (2009), open-sourced in 2010, transferred to Apache 2013
  • Compatible with Hadoop HDFS
  • Designed to be faster and more general purpose than Hadoop MapReduce

MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

MapReduce Example

Determine maximum temperature for each city in our data set.

Toronto, 20 
Whitby, 25 
Brooklyn, 22 
Rome, 32 
Toronto, 4 
Rome, 33 
Brooklyn, 18
Toronto, 21 
Whitby, 24 
Brooklyn, 23 
Rome, 35 
Toronto, 5 
Rome, 36 
Brooklyn, 14
Toronto, 19 
Whitby, 26 
Brooklyn, 19 
Rome, 34 
Toronto, 6 
Rome, 31 
Brooklyn, 16
Toronto, 22 
Whitby, 26 
Brooklyn, 21 
Rome, 34 
Toronto, 2 
Rome, 30 
Brooklyn, 20

1. Map() for each file returns the maximum for each city in the file

Toronto, 20
Whitby, 25
Brooklyn, 22
Rome, 33
Toronto, 21
Whitby, 24
Brooklyn, 23
Rome, 36
Toronto, 19
Whitby, 26
Brooklyn, 19
Rome, 34
Toronto, 22
Whitby, 26
Brooklyn, 21
Rome, 34

2. Reduce() returns the max for the city from all results

Toronto, 22
Whitby, 26
Brooklyn, 23
Rome, 36

Advantages Over MapReduce

  • Single platform contains multiple tools
  • Interactive Queries
  • Improved Performance
  • Simpler Infrastructure Management
  • Clean, concise APIs in Scala, Java and Python

Spark Components

  • Spark Core
  • Spark Streaming
  • Spark Machine Learning (MLib)
  • Spark SQL
  • GraphX

Spark Streaming

  • Extension of the core Spark API
  • Fault-tolerant, high throughput processing of real-time data
  • Diverse data ingestion (Kafka, Flume, Twitter, ZeroMQ, socket)
  • Complex processing of batched data (map, reduce, join)
  • Output to filesystem, dashboards, databases or to other Spark tooling

Machine Learning (MLib)

  • MLlib is a Spark implementation of some common machine learning algorithms and utilities
  • Standard component of Spark
  • Includes common algorithms for classification, regression, clustering and collaborative filtering

Spark SQL

  • Alpha component of Spark 1.0.2
  • Allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark
  • Based on a special type of RDD, SchemaRDD
  • SchemaRDDs can be created from Parquet, JSON or results of HiveQL

Spark GraphX

  • Alpha component of Spark
  • Enables users to interactively load, transform and compute on massive graph structures
  • Fault-tolerant, In-Memory

Spark Architecture

  • Resilient Distributed Dataset (RDD)
  • Directed Acyclic Graph (DAG) Execution Engine
  • ClosureCleaner

Closure & ClosureCleaner

  • Closure is a table storing a reference to each of the non-local variables of a function
  • Scala sometimes errs on the side of capturing too many outer variables
  • ClosureCleaner traverses the object at runtime and prunes the unnecessary references

Concept - RDD

  • Spark's primary abstraction
  • Collection of objects spread across a cluster
  • Allows for in-memory computations on large datasets
  • Fault-tolerant - Automatically rebuilt on failure
  • Controllable Persistence

RDD Operations

  • Transformations - create new datasets from input (e.g. map, flatMap, filter, union, join)
  • Actions - return a value after executing calculations on the dataset (reduce, reduceByKey, take, count)

Types of RDDs

  • Parallelized collections that are based on existing Scala collections
  • Hadoop datasets that are created from the files stored on HDFS

Concept - Execution Engine

  • Graph of tasks to execute and where to execute them
  • Related to RDD lineage

Spark Application Anatomy

  • Spark applications are independent sets of processes on a cluster, coordinated by the SparkContext in a Driver program
  • SparkContext can connect to several types of cluster managers
  • Once connected, Spark acquires executors on cluster nodes
  • It then sends application code (defined by JAR or Python files passed to SparkContext) to the executors
  • Finally, SparkContext sends tasks for the executors to run

Installing Spark

  • Spark runs on both Windows and Unix
  • All that is required is a JVM (Java 6+) with java on the system path, or JAVA_HOME set correctly
  • Get the binaries (http://spark.apache.org/downloads.html)
  • Untar
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz
$ tar xvf spark-1.0.2-bin-hadoop2.tgz
$ ln -s spark-1.0.2-bin-hadoop2 spark

Will also need some sort of data source, like HDFS, Cassandra, text files, etc. on which Spark will operate

What's In the Box?

  • bin - directory containing scripts related to the Spark shell and submitting jobs to Spark
  • sbin - directory containing artifacts related to clusters
  • conf - directory containing configuration files
  • logs - directory containing Spark log files
  • Also a web console (http://localhost:8080)
-rw-r--r--@ 318K Jul 25 15:30 CHANGES.txt
-rw-r--r--@  29K Jul 25 15:30 LICENSE
-rw-r--r--@  22K Jul 25 15:30 NOTICE
-rw-r--r--@ 4.1K Jul 25 15:30 README.md
-rw-r--r--@  35B Jul 25 15:30 RELEASE
drwxr-xr-x@ 612B Jul 25 15:30 bin
drwxr-xr-x@ 340B Aug 21 16:03 conf
drwxr-xr-x@ 238B Jul 25 15:30 ec2
drwxr-xr-x@ 102B Jul 25 15:30 examples
drwxr-xr-x@ 238B Jul 25 15:30 lib
drwxr-xr-x  476B Sep  4 20:05 logs
drwxr-xr-x@ 306B Jul 25 15:30 python
drwxr-xr-x@ 544B Jul 25 15:30 sbin
drwxr-xr-x  2.6K Aug 28 13:19 work

Using Spark Shell

  • Simple way to learn the API
  • Powerful tool to analyze data interactively
  • Shell for Scala, and a shell for Python
$ ./bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.2
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
14/09/07 16:33:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable
Spark context available as sc.

scala>

Spark Shell Demo

Basic Spark Operations

  • Creating RDDs
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> val fruits = sc.parallelize(List("apples", "bananas"))
fruits: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize 
at <console>:12
  • Transformations
scala> val sparkLines = textFile.filter(line => line.contains("spark"))
sparkLines: org.apache.spark.rdd.RDD[String] = FilteredRDD[2] at filter 
at <console>:14
  • Actions
scala> sparkLines.count()
res0: Long = 8

Lazy Evaluation

  • Spark will not actually execute until it sees an action
  • Internally records meta-data to indicate this operation has been requested
  • Affects both loading RDDs and Transformation
  • Another improvement over MapReduce
  • Users are free to organize their program into smaller, more manageable operations

Passing Functions to Spark

  • API relies heavily on passing functions in the driver program to run on the cluster
  • Applies to most Transformations and some Actions
  • Different mechanisms in different languages

Function Passing - Scala

  • Anonymous function syntax
val lines = sc.textFile("README.md")
val lineLengths = lines.map(x => x.length()).reduce((x,y) => x + y)
  • Static methods in a global singleton object
object CounterHelper {
  def myLengthFunction(x: String) : Int = {
    return x.length();
  }
  
  def myAccumlator(x: Int, y: Int) : Int = {
    return x + y
  }
}
val lines = sc.textFile("README.md")
val totalLengths = lines.map(x => CounterHelper.myLengthFunction(x))
                        .reduce((x,y) => CounterHelper.myAccumlator(x, y))

Function Passing - Java

  • Represented by classes implementing interfaces in org.apache.spark.api.java.function package
  • Two ways to implement:

    Implement the Function interfaces inline, in your own anonymous inner class or named class and pass an instance of it to Spark

    or

    Use Java 8 Lambda expressions

     

Function Passing Java Examples

  • Inline
  • Anonymous Inner Class
  • Java 8 Lambda Expression
JavaRDD<String> lines = sc.textFile("README.md");
JavaRDD<Integer> lineLengths = lines.map(new Function<String, Integer>() {
  public Integer call(String s) { return s.length(); }
});
int totalLength = lineLengths.reduce(new Function2<Integer, Integer, Integer>() {
  public Integer call(Integer a, Integer b) { return a + b; }
});
class GetLength implements Function<String, Integer> {
  public Integer call(String s) { return s.length(); }
}
class Sum implements Function2<Integer, Integer, Integer> {
  public Integer call(Integer a, Integer b) { return a + b; }
}

JavaRDD<String> lines = sc.textFile("README.md");
JavaRDD<Integer> lineLengths = lines.map(new GetLength());
int totalLength = lineLengths.reduce(new Sum());
JavaRDD<String> lines = sc.textFile("README.md");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);

Common Transformations

  • filter() - takes a function and returns an RDD whose elements pass the filter function
  • map() - takes a function and returns an RDD whose elements are the result of the function being applied to each element in the original RDD
  • flatMap() - takes a function and returns an RDD of the contents of the iterators of each element

More Common Transformations

  • distinct() - removes duplicates
  • union() - produce an RDD containing all elements from both RDDs
  • intersection() - produce an RDD containing only elements found in both RDDs
  • subtract() - produce an RDD that has none of the elements from another RDD

Common RDD Actions

  • reduce() - takes a function which operates on two elements of the same type within an RDD and returns a new element of the same type 
  • fold() - like reduce() but also takes a "zero value" to be used for the initial call on each partition

More Common RDD Actions

  • aggregate() - takes a zero value (like fold), a function to combine the elements from an RDD with the accumulator and a function to merge two accumulators
val result = input.aggregate((0, 0))(
               (x, y) => (x._1 + y, x._2 + 1),
               (x, y) => (x._1 + y._1, x._2 + y._2))
val avg = result._1 / result._2.toDouble

More Common RDD Actions

  • foreach() - Apply a provided function to each element of the RDD
  • collect() - Return all elements from an RDD
  • count() - Returns the number of elements in an RDD
  • take(num) -Returns the top num elements from an RDD
  • top(num) - Returns num elements from an RDD

Persistence (Caching)

  • Recall Spark RDDs use lazy evaluation
  • Reusing the same RDD multiple times causes Spark to recompute the RDD and all its dependencies
  • Nodes that compute the RDD store their partitions
  • Spark re-computes lost partitions on failure (when needed)
  • Can also replicate across nodes to mitigate performance hit in the event of failure

Persistence is a key tool for iterative algorithms and fast interactive use

Avoid re-computing by utilizing persistence

Spark Persistence - Levels

Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.

Spark Persistence - Levels

Level Space Used CPU Time In-Memory On Disk Comments
MEMORY_ONLY High Low Y N
MEMORY_ONLY_SER Low High Y N
MEMORY_AND_DISK High Medium Some Some Spills to disk if too much data
MEMORY_AND_DISK_SER Low High Some Some Spills to disk if too much data
DISK_ONLY Low High N N

Spark Persistence - Notes

  • Attempt to cache too much, Spark will auto-evict old partitions using an LRU cache policy
  • Memory-only storage Spark recomputes the next time they are accessed... memory-and disk are written out to disk
  • Don't have to worry if you ask Spark to cache too much data
  • Caching unnecessarily can lead to excessive re-compute
  • unpersist() to manually remove items from cache

Working With Pair RDDs

  • groupByKey() - groups together values with the same key.  The values are iterable per key
  • reduceByKey() - returns an RDD where the values for each key are aggregated using the given reduce function 
  • sortByKey() - returns an RDD in key order
  • keys() - returns an RDD of just the keys
  • values() - returns an RDD of just the values
  • mapValues() - Apply a function to each value of a pair RDD without changing the key

Working with Pair RDDs

  • join() - performs an inner join between two RDDs (also a rightOuterJoin() and leftOuterJoin())
  • subtractByKey() - removes elements with a key present in the other RDD

Use Case - Readmission Risk

As part of the Affordable Care Act, the Centers for Medicare & Medicaid Services (CMS) began the the Bundled Payments for Care Improvement initiative (BPCI) and introduced several models to be tested in the US.

In BPCI Model 2, the selected episodes of care will include the inpatient stay in the acute care hospital and all related services during the episode. The episode will end either 30, 60, or 90 days after hospital discharge. 

Research has shown that bundled payments can align incentives for providers – hospitals, post-acute care providers, doctors, and other practitioners – to partner closely across all specialties and settings that a patient may encounter to improve the patient’s experience of care during a hospital stay in an acute care hospital, and during post-discharge recovery.

  • Instead of being paid per procedure, hospitals now are reimbursed once for the entire episode of care.
  • Forces hospitals, and all those involved in the episode, during and after a procedure, to focus on outcomes

Use Case - Readmission Risk

  • One measure that hospitals could assess in this model is the risk of re-admission after a procedure
  • The goal would be to identify patients who are highes-risk to help control costs and achieve better outcomes

Use Case - Readmission Risk

  • Hemoglobin < 12 g/Dl  =  1 point
  • Sodium level < 135 mEq/l  =  1 point
  • Non-elective admission  =  1 point
  • Length of stay >= 5 days  =  2 points
  • 1-5 Admissions last 12 months  =  2 points
  • > 5 Admissions last 12 months  =  5 points

Readmission Risk Scoring

Cassandra Source Schema

Get abnormal lab results

val results = sc.cassandraTable("pjug", "pat_results").cache()
val abnormalHGB = results
    .filter(x => x.getString("test_name") == "HGB")
    .filter(y => y.getDouble("test_value") < 12)
    
val abnormalNA = results
    .filter(x => x.getString("test_name") == "NA")
    .filter(y => y.getDouble("test_value") < 135)
JavaRDD<CassandraRow> abnormalHGB = patients.filter(new Function<CassandraRow, Boolean>() {
	@Override
	public Boolean call(CassandraRow row) throws Exception {
		return ((row.getString("test_name") == "HGB") && 
			(row.getDouble("test_value") < 12));
	}
});

JavaRDD<CassandraRow> abnormalNA = patients.filter(new Function<CassandraRow, Boolean>() {
	@Override
	public Boolean call(CassandraRow row) throws Exception {
		return ((row.getString("test_name") == "NA") && 
			(row.getDouble("test_value") < 135));
	}
});

Java

Scala

Find Admissions (1 - 5 times)

val admit1to5 = encounters
    .select("patient_id")
    .map(row => (row.getString("patient_id"), 1))
    .reduceByKey((x,y) => x + y)
    .filter(row => (row._2 >= 1 && row._2 <= 5))
    .map(x => (x._1, 2))
JavaPairRDD<String, Integer> admit1to5 = encounters
    .select("patient_id")
    .mapToPair(new PairFunction<CassandraRow, String, Integer>() {
	@Override
	public Tuple2<String, Integer> call(CassandraRow arg0) throws Exception {
		return new Tuple2<String, Integer>(arg0.getString("patient_id"), 1);
	}}).reduceByKey(new Function2<Integer, Integer, Integer>() {
	@Override
	public Integer call(Integer arg0, Integer arg1) throws Exception {
		return (arg0 + arg1);
	}}).filter(new Function<Tuple2<String,Integer>, Boolean>() {
	@Override
	public Boolean call(Tuple2<String, Integer> arg0) throws Exception {
		return ((arg0._2>=1) && (arg0._2 <=5));
	}}).mapToPair(new PairFunction<Tuple2<String,Integer>, String, Integer>() {
	@Override
	public Tuple2<String, Integer> call(Tuple2<String, Integer> arg0) throws Exception {
		return new Tuple2<String, Integer>(arg0._1, 2);
	}
});

Java

Scala

Calculating Total Score

val scoring = patients
    .union(abnormals)
    .union(longStay)
    .union(nonElective)
    .union(admit1to5)
    .union(admit5orGt)
    .reduceByKey((x,y) => x + y)
JavaRDD<Score> scoresRDD = allPatients
	.union(abnormals)
	.union(longStay)
	.union(nonElective)
	.union(admit1to5)
	.union(admit5orGt)
	.reduceByKey(new Function2<Integer, Integer, Integer>() {
		@Override
		public Integer call(Integer arg0, Integer arg1) throws Exception {
			return arg0 + arg1;
		}}).map(new Function<Tuple2<String, Integer>, Score>() {
		@Override
	        public Score call(Tuple2<String, Integer> input) throws Exception {
	                return new Score(input._1(), input._2());
	        }
});

Java

Scala

Readmission Risk

Code Walkthrough and Demo

Q & A

Thank You!

Introduction to Apache Spark

By Jason R. Foster

Introduction to Apache Spark

In this session we will introduce Apache Spark, its origins, installation, architecture, tooling, programming constructs and how it addresses specific problems in Big Data applications. A specific use case will be presented as a demonstration of some of Spark's key concepts and core capabilities.

  • 326