Apache Spark

Gleb Kanterov

@kanterov

gleb@kanterov.ru

simple things should be simple, complex things should be possible"

Alan Kay

class RDD[A] {
  def map[B](f: A => B): RDD[B]

  def filter(f: A => Boolean): RDD[A]

  def union(other: RDD[A]): RDD[A]

  def groupBy[B](f: A => B): RDD[(B, Seq[A])]

  def collect(): Array[A]
}

class DataFrame {
  def select(cols: Column*): DataFrame

  def filter(condition: Column): DataFrame

  def col(name: String): Column

  def collect(): Array[Row]
}

def min(col: Column): Column

def avg(col: Column): Column

class Dataset[A: Encoder] {
  def select[U1: Encoder, U2: Encoder](
    c1: TypedColumn[A, U1],
    c2: TypedColumn[A, U1]
  ): Dataset[(U1, U2)]

  def map[U: Encoder](
    func: A => U
  ): Dataset[U]

  def filter(f: A => Boolean): Dataset[A]
}

Frameless

Provide more typeful experience

working with Apache Spark

Statically derived Encoders
Columns are safely referenced
Mirrors value-level computation to type-level for dataset methods

github.com/adelbertc/frameless

Structured Streaming

val logs = ctx
  .read.format("json")
//.open("s3://logs")
  .stream("s3://logs")

logs
  .groupBy(logs("user_id"))
  .agg(sum(logs("time")))
  .write.format("jdbc")
//.save("jdbc:mysql://...")
  .stream("jdbc:mysql://...")

Analytics

data-driven
adhoc research
models
- classifiers
- predictors
number crunching
KPI

Google Cloud
- Storage
- Big Query
- Compute Engine
Luigi
Avro
Spark, Dataflow, BigQuery
R, dplyr, ggplot2
Tableua, Excel, shiny, dashboard