ETL With Apache Spark

Jowanza Joseph

@jowanza

About Me

Software Engineer at One Click Retail
Java Virtual Machine (Scala, Java)
Distributed Systems
Husband and Father

Agenda

A Brief History of ETL
A Brief History of Large Data
The Value of Scala In This Ecosystem
The Lambda Architecture
The Kappa Architecture
Spark For ETL (Code + Demo)
Questions

A Brief History of ETL

Extract

Transform

Load

Slow Data

Why?

Data Is more valuable when it's mixed with other data.

A Brief History Of Large Data

How do you store petabyte scale data?

Glaciers

Distributed File System

The Data Lake

Why Scala?

Types

case class Datum(product: String, category: String, revenue: Int)

val data = sc.textFile("./data.txt")
   .map(_.split(","))
   .map(r=> Datum(r(0), r(1), r(2).toInt))
   .toDF()

Functional Programming

JVM

Big Data Architecture

Lambda Architecture

Kappa Architecture

Apache Spark™ is a fast and general engine for large-scale data processing.

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3

val conf = new SparkConf().setAppName("")
   .setMaster("local[8]").set("spark.executor.memory", "10g")
   .set("spark.driver.memory", "10g")

val sc = new SparkContext(conf)


val conf = new SparkConf().setAppName("MY_APP_NAME").setMaster("MASTER")

val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val data = sqlContext.read
    .format("jdbc")
    .option("url", "jdbc:mysql://<HOST>:3306/<database>")
    .option("user", <USERNAME>)
    .option("password", <PASSWORD>)
    .option("dbtable", "MYSQL_QUERY")
    .load()

Dataframe

Fix Ugly Data

val conf = new SparkConf().setAppName("")
   .setMaster("local[8]").set("spark.executor.memory", "10g")
   .set("spark.driver.memory", "10g")
   .set("com.couchbase.bucket.<>", "password")
   .set("com.couchbase.nodes", "node_ip")


val query1 = "SELECT search_upid as search_upid, brandName, categoryNamePath as categoryName,
title, upcs[0] as upc from `indix-items_staging` where search_upid is NOT NULL and upcs[0] IS NOT NULL"

val dats = sc.couchbaseQuery(N1qlQuery.simple(query1))
   .collect()
   .map(x => x.value.toString)



case class NewInterpret(categoryName: String, upc: String, brandName: String, title: String, search_upid: String)

object NewInterpretProtocol extends DefaultJsonProtocol {
  implicit val interpretformat = jsonFormat5(NewInterpret)
}

import NewInterpretProtocol._

val testValue = dats.map(x => x.parseJson.convertTo[NewInterpret])

val endGame = sc.parallelize(testValue).toDF()

  val rawData = sqlContext.read
    .format("com.databricks.spark.redshift")
    .option("url", "jdbc:redshift://<>")
    .option("tempdir", "s3n://<>")
    .option("query", "select...")
    .load()

Handy For ETL

UDF

val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")

// Define a regular Scala function
val upper: String => String = _.toUpperCase

// Define a UDF that wraps the upper Scala function defined above
// You could also define the function in place, i.e. inside udf
// but separating Scala functions from Spark SQL's UDFs allows for easier testing
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)

// Apply the UDF to change the source dataset
scala> dataset.withColumn("upper", upperUDF('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
|  0|hello|HELLO|
|  1|world|WORLD|
+---+-----+-----+

Pattern Matching

def getRegressionReturns(data: DataFrame): RegressionReturns = {
  Try(new LinearRegression().setMaxIter(100).setRegParam(0.3).fit(data)) match {
    case Success(x) => RegressionReturns(x.intercept, x.coefficients(0))
    case Failure(y) => RegressionReturns(0, 0)
  }
}

Spark Streaming

Questions?

ETL With Apache Spark

By Jowanza Joseph

ETL With Apache Spark

Transforming and moving data with Apache Spark

2,076

ETL With Apache Spark

About Me

Agenda

A Brief History of ETL

Extract

Transform

Load

Slow Data

Why?

A Brief History Of Large Data

How do you store petabyte scale data?

Glaciers

Distributed File System

The Data Lake

Why Scala?

Types

Functional Programming

JVM

Big Data Architecture

Lambda Architecture

Kappa Architecture

Dataframe

Fix Ugly Data

Handy For ETL

UDF

Pattern Matching

Spark Streaming

Questions?

ETL With Apache Spark

More from Jowanza Joseph