Spark Programming
Intro to Spark
Spark started as a research project in the
UC Berkeley RAD Lab - later became AMPLab.
Spark Programming
Intro to Spark
Spark Programming
Intro to Spark
Spark Programming
Intro to Spark
Spark Programming
Intro to Spark
Spark Programming
Intro to Spark
Spark Programming
Intro to Spark
Spark Programming
Intro to Spark
Spark Programming
Intro to Spark ML
Spark Programming
Natural Language Processing Lab
Spark Programming
Natural Language Processing Lab
Spark Programming
TF-IDF & K-Means Lab
Spark Programming
Power Plant Labs
Spark Programming
Power Plant Labs
val v = sc.version.replace(".", "").toInt >= 140
require(v, "Spark 1.4.0+ is required for this lab.")
Spark Programming
Power Plant Labs
The first step in any machine learning task is to understand the business need.
As described in the overview we are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant.
The problem is a regression problem since the label (or target) we are trying to predict is numeric
Spark Programming
Power Plant Labs
Spark Programming
Power Plant Labs
Spark Programming
Power Plant Labs
%sql
SELECT * FROM power_plant
%sql
desc power_plant
var tempDF = sqlContext.table("power_plant").describe()
Spark Programming
Power Plant Labs
Spark Programming
Power Plant Labs
%sql
select AT as Temperature, PE as Power from power_plant
Spark Programming
Power Plant Labs
%sql
select V as ExhaustVacuum, PE as Power from power_plant;
Spark Programming
Power Plant Labs
%sql
select AP as Pressure, PE as Power from power_plant;
Spark Programming
Power Plant Labs
%sql
select RH as Humidity, PE as Power from power_plant;
Spark Programming
Power Plant Labs
import org.apache.spark.ml.feature.VectorAssembler
val dataset = sqlContext.table("power_plant")
val vectorizer = new VectorAssembler()
vectorizer.setInputCols(Array("AT", "V", "AP", "RH"))
vectorizer.setOutputCol("features")
Spark Programming
Power Plant Labs
Spark Programming
Power Plant Labs
// Create a 20/80 split
var Array(split20, split80) = dataset.randomSplit(Array(0.20, 0.80), 1800009193L)
// Cache our data
val testSet = split20.cache()
val trainingSet = split80.cache()
// materialize our caches
testSet.count()
trainingSet.count()
Spark Programming
Power Plant Labs
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.ml.Pipeline
val lr = new LinearRegression()
lr.explainParams()
lr.setPredictionCol("Predicted_PE")
.setLabelCol("PE")
.setMaxIter(100)
.setRegParam(0.1)
Spark Programming
Power Plant Labs
val lrPipeline = new Pipeline()
lrPipeline.setStages(Array(vectorizer, lr))
val lrModel = lrPipeline.fit(trainingSet)
Spark Programming
Power Plant Labs
Spark Programming
Power Plant Labs
val lrm = lrModel.stages(1).asInstanceOf[LinearRegressionModel]
val intercept = lrm.intercept
val weights = lrModel.stages(1).asInstanceOf[LinearRegressionModel].weights.toArray
val featuresNoLabel = dataset.columns.filter(col => col != "PE")
val coefficents = sc.parallelize(weights).zip(sc.parallelize(featuresNoLabel))
var equation = s"y = $intercept "
var variables = Array
coefficents.sortByKey().collect().foreach(x =>
{
val weight = Math.abs(x._1)
val name = x._2
val symbol = if (x._1 > 0) "+" else "-"
equation += (s" $symbol (${weight} * ${name})")
}
)
println("Linear Regression Equation: " + equation)
Spark Programming
Power Plant Labs
Spark Programming
Power Plant Labs
val predictionsAndLabels = lrModel.transform(testSet)
val predictions = predictionsAndLabels.select("AT", "V", "AP", "RH", "PE", "Predicted_PE")
display(predictions)
Spark Programming
Power Plant Labs
val rowRDD = predictionsAndLabels.select("Predicted_PE", "PE").rdd
val results = rowRDD.map(r => (r(0).asInstanceOf[Double], r(1).asInstanceOf[Double]))
import org.apache.spark.mllib.evaluation.RegressionMetrics
val metrics = new RegressionMetrics(results)
printf("Root Mean Squared Error: %s\n", metrics.rootMeanSquaredError)
printf("Explained Variance: %s\n", metrics.explainedVariance)
printf("R2: %s\n", metrics.r2)
println("="*40)
Spark Programming
Power Plant Labs
val tempDF = predictionsAndLabels.selectExpr(
"PE",
"Predicted_PE",
"PE - Predicted_PE Residual_Error",
s""" abs(PE - Predicted_PE) / ${metrics.rootMeanSquaredError} Within_RSME""")
tempDF.registerTempTable("Power_Plant_RMSE_Evaluation") // for later
display(tempDF)
%sql
SELECT ceiling(Within_RSME) as Within_RSME,
count(*) as count
from Power_Plant_RMSE_Evaluation
GROUP BY ceiling(Within_RSME)
Spark Programming
Power Plant Labs
Let's try to make a better model by tuning over several parameters to see if we can get better results
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.evaluation._
val regEval = new RegressionEvaluator()
regEval.setLabelCol("PE")
.setPredictionCol("Predicted_PE")
.setMetricName("rmse")
val crossval = new CrossValidator()
crossval.setEstimator(lrPipeline)
crossval.setNumFolds(5)
crossval.setEvaluator(regEval)
Spark Programming
Power Plant Labs
val regParam = ((1 to 10) toArray).map(x => (x /100.0))
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, regParam)
.build()
crossval.setEstimatorParamMaps(paramGrid)
val cvModel = crossval.fit(trainingSet)
Spark Programming
Power Plant Labs
val predictionsAndLabels = cvModel.transform(testSet)
val result = predictionsAndLabels
.select("Predicted_PE", "PE").rdd
.map(r => (r(0).asInstanceOf[Double], r(1).asInstanceOf[Double]))
val metrics = new RegressionMetrics(result)
printf(s"Root Mean Squared Error: %s\n", metrics.rootMeanSquaredError)
printf(s"Explained Variance: %s\n", metrics.explainedVariance)
printf(s"R2: %s\n", metrics.r2)
println("="*40)
Spark Programming
Power Plant Labs
A Decision Tree creates a model based on splitting variables using a tree structure. We will first start with a single decision tree model.
import org.apache.spark.ml.regression.DecisionTreeRegressor
val dt = new DecisionTreeRegressor()
dt.setLabelCol("PE")
dt.setPredictionCol("Predicted_PE")
dt.setFeaturesCol("features")
dt.setMaxBins(100)
val dtPipeline = new Pipeline()
dtPipeline.setStages(Array(vectorizer, dt))
crossval.setEstimator(dtPipeline)
val paramGrid = new ParamGridBuilder()
.addGrid(dt.maxDepth, Array(2, 3))
.build()
crossval.setEstimatorParamMaps(paramGrid)
val dtModel = crossval.fit(trainingSet)
Spark Programming
Power Plant Labs
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.PipelineModel
val predictionsAndLabels = dtModel.bestModel.transform(testSet)
val result = predictionsAndLabels
.select("Predicted_PE", "PE")
.map(r => (r(0).asInstanceOf[Double], r(1).asInstanceOf[Double]))
val metrics = new RegressionMetrics(result)
printf(s"Root Mean Squared Error: %s\n", metrics.rootMeanSquaredError)
printf(s"Explained Variance: %s\n", metrics.explainedVariance)
printf(s"R2: %s\n", metrics.r2)
println("="*40)
Spark Programming
Power Plant Labs
dtModel.bestModel
.asInstanceOf[PipelineModel]
.stages
.last
.asInstanceOf[DecisionTreeRegressionModel]
.toDebugString
Spark Programming
Power Plant Labs
WARNING: This could take up to three minutes to run
import org.apache.spark.ml.regression.GBTRegressor
val gbt = new GBTRegressor()
gbt.setLabelCol("PE")
gbt.setPredictionCol("Predicted_PE")
gbt.setFeaturesCol("features")
gbt.setSeed(100088121L)
gbt.setMaxBins(30)
gbt.setMaxIter(30)
val gbtPipeline = new Pipeline()
gbtPipeline.setStages(Array(vectorizer, gbt))
crossval.setEstimator(gbtPipeline)
val paramGrid = new ParamGridBuilder()
.addGrid(gbt.maxDepth, Array(2, 3))
.build()
crossval.setEstimatorParamMaps(paramGrid)
val gbtModel = crossval.fit(trainingSet)
Spark Programming
Power Plant Labs
import org.apache.spark.ml.regression.GBTRegressionModel
val predictionsAndLabels = gbtModel.bestModel.transform(testSet)
val results = predictionsAndLabels
.select("Predicted_PE", "PE")
.map(r => (r(0).asInstanceOf[Double], r(1).asInstanceOf[Double]))
val metrics = new RegressionMetrics(results)
printf(s"Root Mean Squared Error: %s\n", metrics.rootMeanSquaredError)
printf(s"Explained Variance: %s\n", metrics.explainedVariance)
printf(s"R2: %s\n", metrics.r2)
println("="*40)
Spark Programming
Power Plant Labs
import org.apache.spark.ml.regression.GBTRegressionModel
val predictionsAndLabels = gbtModel.bestModel.transform(testSet)
val results = predictionsAndLabels
.select("Predicted_PE", "PE")
.map(r => (r(0).asInstanceOf[Double], r(1).asInstanceOf[Double]))
val metrics = new RegressionMetrics(results)
printf(s"Root Mean Squared Error: %s\n", metrics.rootMeanSquaredError)
printf(s"Explained Variance: %s\n", metrics.explainedVariance)
printf(s"R2: %s\n", metrics.r2)
println("="*40)
val finalModel = gbtModel.bestModel
Spark Programming
Power Plant Labs
import java.nio.ByteBuffer
import java.net._
import java.io._
import scala.io._
import sys.process._
import org.apache.spark.Logging
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.Minutes
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.streaming.receiver.Receiver
import sqlContext._
import net.liftweb.json.DefaultFormats
import net.liftweb.json._
import scala.collection.mutable.SynchronizedQueue
Spark Programming
Power Plant Labs
Create and start the StreamingContext.
val queue = new SynchronizedQueue[RDD[String]]()
def creatingFunc(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(2))
val batchInterval = Seconds(1)
ssc.remember(Seconds(300))
val dstream = ssc.queueStream(queue)
dstream.foreachRDD {
rdd =>
if(!(rdd.isEmpty())) {
finalModel.transform(read.json(rdd).toDF())
.write
.mode(SaveMode.Append)
.saveAsTable("power_plant_predictions")
}
}
ssc
}
val ssc = StreamingContext.getActiveOrCreate(creatingFunc)
ssc.start()