Apache Spark


Gleb Kanterov




simple things should be simple, complex things should be possible"


Alan Kay

class RDD[A] {
  def map[B](f: A => B): RDD[B]

  def filter(f: A => Boolean): RDD[A]

  def union(other: RDD[A]): RDD[A]

  def groupBy[B](f: A => B): RDD[(B, Seq[A])]

  def collect(): Array[A]
class DataFrame {
  def select(cols: Column*): DataFrame

  def filter(condition: Column): DataFrame

  def col(name: String): Column

  def collect(): Array[Row]

def min(col: Column): Column

def avg(col: Column): Column
class Dataset[A: Encoder] {
  def select[U1: Encoder, U2: Encoder](
    c1: TypedColumn[A, U1],
    c2: TypedColumn[A, U1]
  ): Dataset[(U1, U2)]

  def map[U: Encoder](
    func: A => U
  ): Dataset[U]

  def filter(f: A => Boolean): Dataset[A]



   ​Provide more typeful experience

working with Apache Spark  

  • Statically derived Encoders
  • Columns are safely referenced
  • Mirrors value-level computation to type-level for dataset methods



Structured Streaming

val logs = ctx



  • data-driven

  • adhoc research

  • models

    • classifiers

    • predictors

  • number crunching

  • KPI

  • Google Cloud
    • Storage
    • Big Query
    • Compute Engine
  • Luigi
  • Avro
  • Spark, Dataflow, BigQuery
  • R, dplyr, ggplot2
  • Tableua, Excel, shiny, dashboard

Apache Spark

By Gleb Kanterov

Apache Spark

  • 407
Loading comments...

More from Gleb Kanterov