ETL With Apache Spark

Jowanza Joseph

@jowanza

About Me

  • Software Engineer at One Click Retail
  • Java Virtual Machine (Scala, Java)
  • Distributed Systems
  • Husband and Father

Agenda

  • A Brief History of ETL
  • A Brief History of Large Data
  • The Value of Scala In This Ecosystem
  • The Lambda Architecture
  • The Kappa Architecture
  • Spark For ETL (Code + Demo)
  • Questions

A Brief History of ETL

Extract

Transform

Load

Slow Data

Why?

Data Is more valuable when it's mixed with other data.

A Brief History Of Large Data

How do you store petabyte scale data?

Glaciers

Distributed File System

The Data Lake

Why Scala?

Types

case class Datum(product: String, category: String, revenue: Int)

val data = sc.textFile("./data.txt")
   .map(_.split(","))
   .map(r=> Datum(r(0), r(1), r(2).toInt))
   .toDF()

Functional Programming

JVM

Big Data Architecture

Lambda Architecture

Kappa Architecture

Apache Spark™ is a fast and general engine for large-scale data processing.

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3

val conf = new SparkConf().setAppName("")
   .setMaster("local[8]").set("spark.executor.memory", "10g")
   .set("spark.driver.memory", "10g")

val sc = new SparkContext(conf)

val conf = new SparkConf().setAppName("MY_APP_NAME").setMaster("MASTER")

val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val data = sqlContext.read
    .format("jdbc")
    .option("url", "jdbc:mysql://<HOST>:3306/<database>")
    .option("user", <USERNAME>)
    .option("password", <PASSWORD>)
    .option("dbtable", "MYSQL_QUERY")
    .load()

Dataframe

Fix Ugly Data

val conf = new SparkConf().setAppName("")
   .setMaster("local[8]").set("spark.executor.memory", "10g")
   .set("spark.driver.memory", "10g")
   .set("com.couchbase.bucket.<>", "password")
   .set("com.couchbase.nodes", "node_ip")


val query1 = "SELECT search_upid as search_upid, brandName, categoryNamePath as categoryName,
title, upcs[0] as upc from `indix-items_staging` where search_upid is NOT NULL and upcs[0] IS NOT NULL"

val dats = sc.couchbaseQuery(N1qlQuery.simple(query1))
   .collect()
   .map(x => x.value.toString)


case class NewInterpret(categoryName: String, upc: String, brandName: String, title: String, search_upid: String)

object NewInterpretProtocol extends DefaultJsonProtocol {
  implicit val interpretformat = jsonFormat5(NewInterpret)
}

import NewInterpretProtocol._

val testValue = dats.map(x => x.parseJson.convertTo[NewInterpret])

val endGame = sc.parallelize(testValue).toDF()
  val rawData = sqlContext.read
    .format("com.databricks.spark.redshift")
    .option("url", "jdbc:redshift://<>")
    .option("tempdir", "s3n://<>")
    .option("query", "select...")
    .load()

Handy For ETL

UDF

val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")

// Define a regular Scala function
val upper: String => String = _.toUpperCase

// Define a UDF that wraps the upper Scala function defined above
// You could also define the function in place, i.e. inside udf
// but separating Scala functions from Spark SQL's UDFs allows for easier testing
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)

// Apply the UDF to change the source dataset
scala> dataset.withColumn("upper", upperUDF('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
|  0|hello|HELLO|
|  1|world|WORLD|
+---+-----+-----+

Pattern Matching

def getRegressionReturns(data: DataFrame): RegressionReturns = {
  Try(new LinearRegression().setMaxIter(100).setRegParam(0.3).fit(data)) match {
    case Success(x) => RegressionReturns(x.intercept, x.coefficients(0))
    case Failure(y) => RegressionReturns(0, 0)
  }
}

Spark Streaming

Questions?

ETL With Apache Spark

By Jowanza Joseph

ETL With Apache Spark

Transforming and moving data with Apache Spark

  • 2,057