Apache Spark overview

MapReduce reprise

Welcome to Scala


trait MapReduceAlgorithm[K1, V1, // Generics
                         K2, V2, 
                         K3, V3] {
  // One key-value pair to list of key-value pairs
  // (K, V) means pair
  def mrMap(key: K1, value: V1): Seq[(K2, V2)]
  // ....Magic in between...
  // list of values to list of key-values
  def mrReduce(key: K2, values: Seq[V2]):
      Seq[(K3, V3)]
  // ....
}

WordCount

class WordCountMapReduce 
  extends MapReduceAlgorithm[Long, String, String, Int, String, Int] {

  def mrMap(key: Long, line: String) =
    line
      .split("\\W+") // Seq[String]
      .map { x => (x, 1) } // Seq[(String, Int)]

  def mrReduce(word: String, count: Seq[Int]) =
    Seq(
      (word, count.sum) // (String, Int)
    ) // Seq[String, Int]
}

MR flaws

Counterintuitive
Disk IO
Imperative on high-level

Aside: Tachyon

http://tachyon-project.org/

Scala functional collections

Lingua franca of functional languages

Using higher-order functions to be declarative

val numbers = (1 to 1000) // 1, 2, 3, 4, .... 1000
val twice = numbers map { x => x * 2 } //  2, 4, 6, ... 2000
val even = numbers filter { x => x % 2 == 0 } // 2, 4, 6, ... 1000
val sum = numbers reduce { (x, y) => x + y }  // 500500
val craziness = numbers flatMap { x => (1 to x) } // 1, 1, 2, 1, 2, 3, ... 1000

Applicable to number of collections

option (list of 0 or 1 element)
single-linked list
set
array
queue
par*

Spark RDD

Quiz: what is an “RDD”?

distributed collection of objects on disk
distributed collection of objects in memory
distributed collection of objects in Cassandra

Resilient Data Set

trait RDD[T] {
  // Public API
  def filter(p: T => Boolean): RDD[T]
  def map(m: T => B): RDD[B]
  def flatMap(m: T => Seq[B]): RDD[B]
  // ...
  // Implementation details
}

Transforms

trait RDD[A] {
  def map(m: A => B): RDD[B]
  def filter(p: A => Boolean): RDD[A]
  def flatMap(m: A => Seq[B]): RDD[B]
  def sample: RDD[A]
  def union(other: RDD[A]): RDD[A]
  def intersection(other: RDD[A]): RDD[A]
  def distinct: RDD[A]
  def pipe(command: String): RDD[String]
  def cartesian(other: RDD[B]): RDD[(A, B)]
}

Pair Transforms

trait PairRDD[K,V] extends RDD[(K, V)] {
  def groupByKey: RDD[K, Seq[V]]
  def reduceByKey(f: (V,V)=>V): PairRDD[K, V]
  def aggregateByKey(zero: W, m: (W, V) => W, p: (W,W) => W): PairRDD[K, W]
  def join(other: PairRDD[K, W]): PairRDD[K, (V,W)]
  def cogroup(other: PairRDD[(K,W)]): PairRDD[K, (Seq(V), Seq(W))]
}

Actions

trait RDD[A] {
  def reduce(f: (A, A) => A): A
  def collect: Seq[A]
  def count: Long
  def first: A
  def take(n: Int): Seq[A]
  def takeSample: Seq[A]
  def takeOrdered: Seq[A]
  def saveAsTextFile(file: String)
  def saveAsSequenceFile(file: String)
  def saveAsObjectFile(file: String)
}

Word Count

val sc: SparkContext // This is our cluster

val linesRDD = sc.textFile("hdfs://file.txt")

val counts = linesRDD.flatMap { x => x.split("\\W+") } 
                     // I would merge flatMap and map in real life
                     .map { x => (x, 1) }
                     .reduceByKey { (x, y) => x + y }

// Counts is an RDD
counts.saveAsTextFile("hdfs://counts.txt")
// Action

counts.collect // Another action - may not fit

Caching

 val words = linesRDD.flatMap(x => x.split("\\W+")).cache

Apache Spark overview

MapReduce reprise

Welcome to Scala

WordCount

MR flaws

Aside: Tachyon

http://tachyon-project.org/

Scala functional collections

Using higher-order functions to be declarative

Applicable to number of collections

Spark RDD

Quiz: what is an “RDD”?

Resilient Data Set

Transforms

Pair Transforms

Actions

Word Count

Caching

Under the hood

Title Text

SQL

Streaming

Machine learning

Graphs

deck

deck

Yura Taras

Apache Spark overview

MapReduce reprise

Welcome to Scala

WordCount

MR flaws

Aside: Tachyon

http://tachyon-project.org/

Scala functional collections

Using higher-order functions to be declarative

Applicable to number of collections

Spark RDD

Quiz: what is an “RDD”?

Resilient Data Set

Transforms

Pair Transforms

Actions

Word Count

Caching

Under the hood

Title Text

SQL

Streaming

Machine learning

Graphs

deck

More from Yura Taras