Apache Spark

Gleb Kanterov

@kanterov

gleb@kanterov.ru

simple things should be simple, complex things should be possible"

Alan Kay

class RDD[A] {
  def map[B](f: A => B): RDD[B]

  def filter(f: A => Boolean): RDD[A]

  def union(other: RDD[A]): RDD[A]

  def groupBy[B](f: A => B): RDD[(B, Seq[A])]

  def collect(): Array[A]
}

class DataFrame {
  def select(cols: Column*): DataFrame

  def filter(condition: Column): DataFrame

  def col(name: String): Column

  def collect(): Array[Row]
}

def min(col: Column): Column

def avg(col: Column): Column

class Dataset[A: Encoder] {
  def select[U1: Encoder, U2: Encoder](
    c1: TypedColumn[A, U1],
    c2: TypedColumn[A, U1]
  ): Dataset[(U1, U2)]

  def map[U: Encoder](
    func: A => U
  ): Dataset[U]

  def filter(f: A => Boolean): Dataset[A]
}

Frameless

Provide more typeful experience

working with Apache Spark

Statically derived Encoders
Columns are safely referenced
Mirrors value-level computation to type-level for dataset methods

github.com/adelbertc/frameless

Structured Streaming

val logs = ctx
  .read.format("json")
//.open("s3://logs")
  .stream("s3://logs")

logs
  .groupBy(logs("user_id"))
  .agg(sum(logs("time")))
  .write.format("jdbc")
//.save("jdbc:mysql://...")
  .stream("jdbc:mysql://...")

Analytics

data-driven
adhoc research
models
- classifiers
- predictors
number crunching
KPI

Google Cloud
- Storage
- Big Query
- Compute Engine
Luigi
Avro
Spark, Dataflow, BigQuery
R, dplyr, ggplot2
Tableua, Excel, shiny, dashboard

Apache Spark

By Gleb Kanterov

Apache Spark

simple things should be simple, complex things should be possible"

Frameless

Structured Streaming

Analytics

Apache Spark

More from Gleb Kanterov