Frameless: A More Well-Typed Interface for Spark

About me

Software Engineer on data science/quant team at Coatue Management
Scala for the last 5 years, data engineering for the last 1.5 years
Brooklyn! (by way of Texas)

What is Frameless?

Typelevel project for adding a more well-typed veneer on Apache Spark
https://github.com/typelevel/frameless
Powered by shapeless and cats

A Brief History of Spark APIs

RDD[T]

Resilient Distributed Dataset
Scala collection distributed by Spark under the hood
- FP combinators like `.map`, `.filter`
Fault tolerant and parallel
- But operations not optimized

RDD

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

case class Artist(name: String, age: Int)

val defaultArtists = Seq(
  Artist("Offset", 25),
  Artist("Kanye West", 39),
  Artist("Frank Ocean", 29),
  Artist("John Mayer", 39),
  Artist("Aretha Franklin", 74),
  Artist("Kendrick Lamar", 29),
  Artist("Carly Rae Jepsen", 31))

val spark = SparkSession.builder().master("local[*]").getOrCreate
val artists = spark.sparkContext.parallelize(defaultArtists)

val (totalAge, totalCount) = artists
  .map(a => (a.age, 1))
  .reduce { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }

println(s"Average age: ${totalAge.toDouble / totalCount.toDouble}")

////

scala> Average age: 38.0

A Brief History of Spark APIs

DataFrame

Fault tolerant & parallel like RDDs
Optimized with Catalyst engine - Spark SQL
- Logical/physical plans
- More efficient queries for just the data you need
Not well-typed - 'untyped' API :(

DataFrame

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.avg

val artists: DataFrame = spark.createDataFrame(defaultArtists)

artists.agg(avg("age")).show

+--------+
|avg(age)|
+--------+
|    38.0|
+--------+

artists.select("genre").show // throws an exception :(

org.apache.spark.sql.AnalysisException: cannot resolve '`genre`' given input columns: [name, age];;
'Project ['genre]
+- LocalRelation [name#0, age#1]
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)

A Brief History of Spark APIs

Dataset[T]

Think RDD[T] + DataFrame
- Compile time safety of RDD
- Optimizations of DataFrame
type DataFrame = Dataset[Row]
Efficient internal memory representation with Encoders
Encoder compatibility with common types:
- Int, String, Long, etc.
- java.sql.Date, java.sql.Timestamp
- case classes, tuples

Dataset[T]

import org.apache.spark.sql.Dataset

import spark.implicits._  // import default Encoders

val artists: Dataset[Artist] = spark.createDataset(defaultArtists)

artists
  .filter(_.age > 30)     // typed API, like Scala collections/RDD
  .agg(avg("age")).show   // untyped API from DataFrame

+--------+
|avg(age)|
+--------+
|   45.75|
+--------+

artists.select("genre").show // hmm, still throws an exception...

org.apache.spark.sql.AnalysisException: cannot resolve '`genre`' given input columns: [name, age];;
'Project ['genre]
+- LocalRelation [name#2, age#3]
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)

TypedDataset[T]

Wraps over Spark Dataset[T]
- No performance difference at runtime
Type safe columns (not stringly typed!)
- Powered by shapeless
- Witnesses, Selectors, Records (things I don't yet grok)
- Core idea is compile-time evidence
Spark "actions" == return `Job`s that needs to be explicitly `.run`
Uses TypedEncoders for compile-time checking of encoded types
Limited support for aggregation functions - not 100% API coverage from org.apache.spark.sql.functions

TypedDataset[T]

Type safe columns (instead of stringly typed columns!)

Type safe columns (instead of stringly typed columns!)

import frameless._
implicit val sqlContext = spark.sqlContext // required for frameless
val artists: TypedDataset[Artist] = TypedDataset.create(defaultArtists)

artists
  .filter(_.age > 30)
  .select(avg(artists('age))) // typechecked column name!
  .show().run                 // explicit `.run`

+-----+
|   _1|
+-----+
|45.75|
+-----+

artists.filter(_.age > 30).select(artists('name)).show().run

+----------------+
|              _1|
+----------------+
|      Kanye West|
|      John Mayer|
| Aretha Franklin|
|Carly Rae Jepsen|
+----------------+

artists.select(artists('blah)) // doesn't compile

TypedDataset[T]

case class AgeCount(age: Int, count: Long)

artists
  .groupBy(artists('age))
  .agg(count(artists('age)))
  .as[AgeCount] // compile-time `.as`!
  .filter(_.count > 1)
  .show().run

+---+-----+
|age|count|
+---+-----+
| 39|    2|
| 29|    2|
+---+-----+

Safer groupBy, safer .as[T]

TypedEncoder[T]

Static, compile-time encoders, recursively resolved
Think io.circe.Encoder or scodec.Codec
Contrast with Spark Encoders
- Runtime exceptions when Encoder not found
- Still uses some reflection, but less so
- Can't define custom Spark Encoders yet:

"Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._
Support for serializing other types will be added in future releases"

Injection[A, B]

Custom-rolled encoding for types that might not be supported with out of the box by frameless or Spark
aka bijection, a one-to-one correspondence
Define functions:
- A => B
- B => A

Injection[A, B]

sealed abstract class Genre
object Genre {
  case object HipHop extends Genre
  case object RnB    extends Genre
  case object Soul   extends Genre
  case object Pop    extends Genre
  case object Rock   extends Genre
}

case class ArtistWithGenre(artist: Artist, genre: Genre)

// Won't compile:
// could not find implicit value for parameter encoder:
//   frameless.TypedEncoder[examples.ArtistWithGenre]
val artistsWithGenre: TypedDataset[ArtistWithGenre] = TypedDataset.create(Seq(
  ArtistWithGenre(Artist("Offset",            25), Genre.HipHop),
  ArtistWithGenre(Artist("Kanye West",        39), Genre.HipHop),
  ArtistWithGenre(Artist("Frank Ocean",       29), Genre.RnB),
  ArtistWithGenre(Artist("John Mayer",        39), Genre.Rock),
  ArtistWithGenre(Artist("Aretha Franklin",   74), Genre.Soul),
  ArtistWithGenre(Artist("Kendrick Lamar",    29), Genre.HipHop),
  ArtistWithGenre(Artist("Carly Rae Jepsen",  31), Genre.Pop)))

Injection[A, B]

// define an implicit Injection and frameless will use it
// to create a TypedEncoder
implicit val genreInjection = new Injection[Genre, Int] {
  def apply(genre: Genre): Int = genre match {
    case Genre.HipHop => 1
    case Genre.RnB    => 2
    case Genre.Soul   => 3
    case Genre.Pop    => 4
    case Genre.Rock   => 5
  }

  def invert(i: Int): Genre = i match {
    case 1 => Genre.HipHop
    case 2 => Genre.RnB
    case 3 => Genre.Soul
    case 4 => Genre.Pop
    case 5 => Genre.Rock
  }
}

Injection[A, B]

import cats.Eq
import cats.implicits._

implicit val genreEq: Eq[Genre] = new Eq[Genre] {
  def eqv(g1: Genre, g2: Genre): Boolean = g1 == g2
}

// Compiles!
val artistsWithGenre: TypedDataset[ArtistWithGenre] = TypedDataset.create(Seq(
  ArtistWithGenre(Artist("Offset",            25), Genre.HipHop),
  ArtistWithGenre(Artist("Kanye West",        39), Genre.HipHop),
  ArtistWithGenre(Artist("Frank Ocean",       29), Genre.RnB),
  ArtistWithGenre(Artist("John Mayer",        39), Genre.Rock),
  ArtistWithGenre(Artist("Aretha Franklin",   74), Genre.Soul),
  ArtistWithGenre(Artist("Kendrick Lamar",    29), Genre.HipHop),
  ArtistWithGenre(Artist("Carly Rae Jepsen",  31), Genre.Pop)))

artistsWithGenre.filter(_.genre === Genre.HipHop).show().run

+-------------------+-----+
|             artist|genre|
+-------------------+-----+
|        [Offset,25]|    1|
|    [Kanye West,39]|    1|
|[Kendrick Lamar,29]|    1|
+-------------------+-----+

Cats instances for Dataset[T]?

Eh... it's tricky. (s/o @jeremyrsmith).
Example: Functor[Dataset]
- Dataset#map needs an implicit Encoder[A]
- Can't use same trick as Functor[Future] with an implicit ExecutionContext
Example: Monad[Dataset]
- Defined correctly, flatMap could be a Cartesian join
  - Dataset[A] => Dataset[Dataset[B]]
- Read: very easy to blow up your Spark job

(Maybe there's a way out, but it's not obvious.)

¯\_(ツ)_/¯

In closing...

Frameless needs users and contributors!
Gitter: https://gitter.im/typelevel/frameless
Examples from slides:
- https://github.com/longcao/frameless-examples
Shoutouts to the real Frameless people:

ｔｙｐｅｓａｆｅｃｏｌｕｍｎｓ

Interested?

What we do: data science @ Coatue
- Terabyte-scale data engineering
- Machine learning
- Quant trading
- NLP
Stack
- Scala
- Spark
- AWS (S3, Redshift, etc.)
- R, Python
- Tableau
Chat with me or email: lcao@coatue.com
Twitter: @oacgnol