
Jason Foster - Orion Healthcare
Orion Health is a global, independently owned eHealth software company with proven experience in delivering interoperable, connected solutions for healthcare facilities, organizations and regions.
The Scottsdale location is focused on BI & Analytics on a Big Data platform.
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Determine maximum temperature for each city in our data set.
Toronto, 20
Whitby, 25
Brooklyn, 22
Rome, 32
Toronto, 4
Rome, 33
Brooklyn, 18Toronto, 21
Whitby, 24
Brooklyn, 23
Rome, 35
Toronto, 5
Rome, 36
Brooklyn, 14Toronto, 19
Whitby, 26
Brooklyn, 19
Rome, 34
Toronto, 6
Rome, 31
Brooklyn, 16Toronto, 22
Whitby, 26
Brooklyn, 21
Rome, 34
Toronto, 2
Rome, 30
Brooklyn, 201. Map() for each file returns the maximum for each city in the file
Toronto, 20
Whitby, 25
Brooklyn, 22
Rome, 33Toronto, 21
Whitby, 24
Brooklyn, 23
Rome, 36Toronto, 19
Whitby, 26
Brooklyn, 19
Rome, 34Toronto, 22
Whitby, 26
Brooklyn, 21
Rome, 342. Reduce() returns the max for the city from all results
Toronto, 22
Whitby, 26
Brooklyn, 23
Rome, 36


$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz
$ tar xvf spark-1.0.2-bin-hadoop2.tgz
$ ln -s spark-1.0.2-bin-hadoop2 sparkWill also need some sort of data source, like HDFS, Cassandra, text files, etc. on which Spark will operate
-rw-r--r--@ 318K Jul 25 15:30 CHANGES.txt
-rw-r--r--@ 29K Jul 25 15:30 LICENSE
-rw-r--r--@ 22K Jul 25 15:30 NOTICE
-rw-r--r--@ 4.1K Jul 25 15:30 README.md
-rw-r--r--@ 35B Jul 25 15:30 RELEASE
drwxr-xr-x@ 612B Jul 25 15:30 bin
drwxr-xr-x@ 340B Aug 21 16:03 conf
drwxr-xr-x@ 238B Jul 25 15:30 ec2
drwxr-xr-x@ 102B Jul 25 15:30 examples
drwxr-xr-x@ 238B Jul 25 15:30 lib
drwxr-xr-x 476B Sep 4 20:05 logs
drwxr-xr-x@ 306B Jul 25 15:30 python
drwxr-xr-x@ 544B Jul 25 15:30 sbin
drwxr-xr-x 2.6K Aug 28 13:19 work$ ./bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.0.2
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
14/09/07 16:33:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Spark context available as sc.
scala>scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3scala> val fruits = sc.parallelize(List("apples", "bananas"))
fruits: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize
at <console>:12scala> val sparkLines = textFile.filter(line => line.contains("spark"))
sparkLines: org.apache.spark.rdd.RDD[String] = FilteredRDD[2] at filter
at <console>:14scala> sparkLines.count()
res0: Long = 8val lines = sc.textFile("README.md")
val lineLengths = lines.map(x => x.length()).reduce((x,y) => x + y)object CounterHelper {
def myLengthFunction(x: String) : Int = {
return x.length();
}
def myAccumlator(x: Int, y: Int) : Int = {
return x + y
}
}
val lines = sc.textFile("README.md")
val totalLengths = lines.map(x => CounterHelper.myLengthFunction(x))
.reduce((x,y) => CounterHelper.myAccumlator(x, y))JavaRDD<String> lines = sc.textFile("README.md");
JavaRDD<Integer> lineLengths = lines.map(new Function<String, Integer>() {
public Integer call(String s) { return s.length(); }
});
int totalLength = lineLengths.reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});class GetLength implements Function<String, Integer> {
public Integer call(String s) { return s.length(); }
}
class Sum implements Function2<Integer, Integer, Integer> {
public Integer call(Integer a, Integer b) { return a + b; }
}
JavaRDD<String> lines = sc.textFile("README.md");
JavaRDD<Integer> lineLengths = lines.map(new GetLength());
int totalLength = lineLengths.reduce(new Sum());JavaRDD<String> lines = sc.textFile("README.md");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);



val result = input.aggregate((0, 0))(
(x, y) => (x._1 + y, x._2 + 1),
(x, y) => (x._1 + y._1, x._2 + y._2))
val avg = result._1 / result._2.toDouble

Persistence is a key tool for iterative algorithms and fast interactive use
Avoid re-computing by utilizing persistence
| Level | Meaning |
|---|---|
| MEMORY_ONLY | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. |
| MEMORY_ONLY_SER | Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. |
| MEMORY_AND_DISK | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. |
| MEMORY_AND_DISK_SER | Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. |
| DISK_ONLY | Store the RDD partitions only on disk. |
| Level | Space Used | CPU Time | In-Memory | On Disk | Comments |
|---|---|---|---|---|---|
| MEMORY_ONLY | High | Low | Y | N | |
| MEMORY_ONLY_SER | Low | High | Y | N | |
| MEMORY_AND_DISK | High | Medium | Some | Some | Spills to disk if too much data |
| MEMORY_AND_DISK_SER | Low | High | Some | Some | Spills to disk if too much data |
| DISK_ONLY | Low | High | N | N |


As part of the Affordable Care Act, the Centers for Medicare & Medicaid Services (CMS) began the the Bundled Payments for Care Improvement initiative (BPCI) and introduced several models to be tested in the US.
In BPCI Model 2, the selected episodes of care will include the inpatient stay in the acute care hospital and all related services during the episode. The episode will end either 30, 60, or 90 days after hospital discharge.
Research has shown that bundled payments can align incentives for providers – hospitals, post-acute care providers, doctors, and other practitioners – to partner closely across all specialties and settings that a patient may encounter to improve the patient’s experience of care during a hospital stay in an acute care hospital, and during post-discharge recovery.
Readmission Risk Scoring

val results = sc.cassandraTable("pjug", "pat_results").cache()
val abnormalHGB = results
.filter(x => x.getString("test_name") == "HGB")
.filter(y => y.getDouble("test_value") < 12)
val abnormalNA = results
.filter(x => x.getString("test_name") == "NA")
.filter(y => y.getDouble("test_value") < 135)JavaRDD<CassandraRow> abnormalHGB = patients.filter(new Function<CassandraRow, Boolean>() {
@Override
public Boolean call(CassandraRow row) throws Exception {
return ((row.getString("test_name") == "HGB") &&
(row.getDouble("test_value") < 12));
}
});
JavaRDD<CassandraRow> abnormalNA = patients.filter(new Function<CassandraRow, Boolean>() {
@Override
public Boolean call(CassandraRow row) throws Exception {
return ((row.getString("test_name") == "NA") &&
(row.getDouble("test_value") < 135));
}
});Java
Scala
val admit1to5 = encounters
.select("patient_id")
.map(row => (row.getString("patient_id"), 1))
.reduceByKey((x,y) => x + y)
.filter(row => (row._2 >= 1 && row._2 <= 5))
.map(x => (x._1, 2))JavaPairRDD<String, Integer> admit1to5 = encounters
.select("patient_id")
.mapToPair(new PairFunction<CassandraRow, String, Integer>() {
@Override
public Tuple2<String, Integer> call(CassandraRow arg0) throws Exception {
return new Tuple2<String, Integer>(arg0.getString("patient_id"), 1);
}}).reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer arg0, Integer arg1) throws Exception {
return (arg0 + arg1);
}}).filter(new Function<Tuple2<String,Integer>, Boolean>() {
@Override
public Boolean call(Tuple2<String, Integer> arg0) throws Exception {
return ((arg0._2>=1) && (arg0._2 <=5));
}}).mapToPair(new PairFunction<Tuple2<String,Integer>, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Tuple2<String, Integer> arg0) throws Exception {
return new Tuple2<String, Integer>(arg0._1, 2);
}
});Java
Scala
val scoring = patients
.union(abnormals)
.union(longStay)
.union(nonElective)
.union(admit1to5)
.union(admit5orGt)
.reduceByKey((x,y) => x + y)JavaRDD<Score> scoresRDD = allPatients
.union(abnormals)
.union(longStay)
.union(nonElective)
.union(admit1to5)
.union(admit5orGt)
.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer arg0, Integer arg1) throws Exception {
return arg0 + arg1;
}}).map(new Function<Tuple2<String, Integer>, Score>() {
@Override
public Score call(Tuple2<String, Integer> input) throws Exception {
return new Score(input._1(), input._2());
}
});Java
Scala
