Apache Spark

 

Gleb Kanterov

@kanterov

gleb@kanterov.ru

 

simple things should be simple, complex things should be possible"

 

Alan Kay

 
class RDD[A] {
  def map[B](f: A => B): RDD[B]

  def filter(f: A => Boolean): RDD[A]

  def union(other: RDD[A]): RDD[A]

  def groupBy[B](f: A => B): RDD[(B, Seq[A])]

  def collect(): Array[A]
}
class DataFrame {
  def select(cols: Column*): DataFrame

  def filter(condition: Column): DataFrame

  def col(name: String): Column

  def collect(): Array[Row]
}

def min(col: Column): Column

def avg(col: Column): Column
class Dataset[A: Encoder] {
  def select[U1: Encoder, U2: Encoder](
    c1: TypedColumn[A, U1],
    c2: TypedColumn[A, U1]
  ): Dataset[(U1, U2)]

  def map[U: Encoder](
    func: A => U
  ): Dataset[U]

  def filter(f: A => Boolean): Dataset[A]
}

Frameless

1

   ​Provide more typeful experience

working with Apache Spark  

  •  
 
  • Statically derived Encoders
  • Columns are safely referenced
  • Mirrors value-level computation to type-level for dataset methods
 

github.com/adelbertc/frameless

 

Structured Streaming

 
val logs = ctx
  .read.format("json")
//.open("s3://logs")
  .stream("s3://logs")

logs
  .groupBy(logs("user_id"))
  .agg(sum(logs("time")))
  .write.format("jdbc")
//.save("jdbc:mysql://...")
  .stream("jdbc:mysql://...")

Analytics

 
  • data-driven

  • adhoc research

  • models

    • classifiers

    • predictors

  • number crunching

  • KPI

 
  • Google Cloud
    • Storage
    • Big Query
    • Compute Engine
  • Luigi
  • Avro
  • Spark, Dataflow, BigQuery
  • R, dplyr, ggplot2
  • Tableua, Excel, shiny, dashboard

Apache Spark

By Gleb Kanterov

Apache Spark

  • 532