Frameless: A More Well-Typed Interface for Spark
About me
- Software Engineer on data science/quant team at Coatue Management
- Scala for the last 5 years, data engineering for the last 1.5 years
- Brooklyn! (by way of Texas)
What is Frameless?
- Typelevel project for adding a more well-typed veneer on Apache Spark
- https://github.com/typelevel/frameless
- Powered by shapeless and cats
+
+
=
A Brief History of Spark APIs
RDD[T]
- Resilient Distributed Dataset
- Scala collection distributed by Spark under the hood
- FP combinators like `.map`, `.filter`
- Fault tolerant and parallel
- But operations not optimized
RDD
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Artist(name: String, age: Int)
val defaultArtists = Seq(
Artist("Offset", 25),
Artist("Kanye West", 39),
Artist("Frank Ocean", 29),
Artist("John Mayer", 39),
Artist("Aretha Franklin", 74),
Artist("Kendrick Lamar", 29),
Artist("Carly Rae Jepsen", 31))
val spark = SparkSession.builder().master("local[*]").getOrCreate
val artists = spark.sparkContext.parallelize(defaultArtists)
val (totalAge, totalCount) = artists
.map(a => (a.age, 1))
.reduce { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
println(s"Average age: ${totalAge.toDouble / totalCount.toDouble}")
////
scala> Average age: 38.0
A Brief History of Spark APIs
DataFrame
- Fault tolerant & parallel like RDDs
- Optimized with Catalyst engine - Spark SQL
- Logical/physical plans
- More efficient queries for just the data you need
- Not well-typed - 'untyped' API :(
DataFrame
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.avg
val artists: DataFrame = spark.createDataFrame(defaultArtists)
artists.agg(avg("age")).show
+--------+
|avg(age)|
+--------+
| 38.0|
+--------+
artists.select("genre").show // throws an exception :(
org.apache.spark.sql.AnalysisException: cannot resolve '`genre`' given input columns: [name, age];;
'Project ['genre]
+- LocalRelation [name#0, age#1]
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
A Brief History of Spark APIs
Dataset[T]
- Think RDD[T] + DataFrame
- Compile time safety of RDD
- Optimizations of DataFrame
- type DataFrame = Dataset[Row]
- Efficient internal memory representation with Encoders
-
Encoder compatibility with common types:
- Int, String, Long, etc.
- java.sql.Date, java.sql.Timestamp
- case classes, tuples
Dataset[T]
import org.apache.spark.sql.Dataset
import spark.implicits._ // import default Encoders
val artists: Dataset[Artist] = spark.createDataset(defaultArtists)
artists
.filter(_.age > 30) // typed API, like Scala collections/RDD
.agg(avg("age")).show // untyped API from DataFrame
+--------+
|avg(age)|
+--------+
| 45.75|
+--------+
artists.select("genre").show // hmm, still throws an exception...
org.apache.spark.sql.AnalysisException: cannot resolve '`genre`' given input columns: [name, age];;
'Project ['genre]
+- LocalRelation [name#2, age#3]
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
TypedDataset[T]
- Wraps over Spark Dataset[T]
- No performance difference at runtime
- Type safe columns (not stringly typed!)
- Powered by shapeless
- Witnesses, Selectors, Records (things I don't yet grok)
- Core idea is compile-time evidence
- Spark "actions" == return `Job`s that needs to be explicitly `.run`
- Uses TypedEncoders for compile-time checking of encoded types
- Limited support for aggregation functions - not 100% API coverage from org.apache.spark.sql.functions
TypedDataset[T]
- Type safe columns (instead of stringly typed columns!)
- Type safe columns (instead of stringly typed columns!)
import frameless._
implicit val sqlContext = spark.sqlContext // required for frameless
val artists: TypedDataset[Artist] = TypedDataset.create(defaultArtists)
artists
.filter(_.age > 30)
.select(avg(artists('age))) // typechecked column name!
.show().run // explicit `.run`
+-----+
| _1|
+-----+
|45.75|
+-----+
artists.filter(_.age > 30).select(artists('name)).show().run
+----------------+
| _1|
+----------------+
| Kanye West|
| John Mayer|
| Aretha Franklin|
|Carly Rae Jepsen|
+----------------+
artists.select(artists('blah)) // doesn't compile
TypedDataset[T]
case class AgeCount(age: Int, count: Long)
artists
.groupBy(artists('age))
.agg(count(artists('age)))
.as[AgeCount] // compile-time `.as`!
.filter(_.count > 1)
.show().run
+---+-----+
|age|count|
+---+-----+
| 39| 2|
| 29| 2|
+---+-----+
- Safer groupBy, safer .as[T]
TypedEncoder[T]
- Static, compile-time encoders, recursively resolved
- Think io.circe.Encoder or scodec.Codec
-
Contrast with Spark Encoders
- Runtime exceptions when Encoder not found
- Still uses some reflection, but less so
- Can't define custom Spark Encoders yet:
"Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._
Support for serializing other types will be added in future releases"
Injection[A, B]
- Custom-rolled encoding for types that might not be supported with out of the box by frameless or Spark
- aka bijection, a one-to-one correspondence
- Define functions:
- A => B
- B => A
Injection[A, B]
sealed abstract class Genre
object Genre {
case object HipHop extends Genre
case object RnB extends Genre
case object Soul extends Genre
case object Pop extends Genre
case object Rock extends Genre
}
case class ArtistWithGenre(artist: Artist, genre: Genre)
// Won't compile:
// could not find implicit value for parameter encoder:
// frameless.TypedEncoder[examples.ArtistWithGenre]
val artistsWithGenre: TypedDataset[ArtistWithGenre] = TypedDataset.create(Seq(
ArtistWithGenre(Artist("Offset", 25), Genre.HipHop),
ArtistWithGenre(Artist("Kanye West", 39), Genre.HipHop),
ArtistWithGenre(Artist("Frank Ocean", 29), Genre.RnB),
ArtistWithGenre(Artist("John Mayer", 39), Genre.Rock),
ArtistWithGenre(Artist("Aretha Franklin", 74), Genre.Soul),
ArtistWithGenre(Artist("Kendrick Lamar", 29), Genre.HipHop),
ArtistWithGenre(Artist("Carly Rae Jepsen", 31), Genre.Pop)))
Injection[A, B]
// define an implicit Injection and frameless will use it
// to create a TypedEncoder
implicit val genreInjection = new Injection[Genre, Int] {
def apply(genre: Genre): Int = genre match {
case Genre.HipHop => 1
case Genre.RnB => 2
case Genre.Soul => 3
case Genre.Pop => 4
case Genre.Rock => 5
}
def invert(i: Int): Genre = i match {
case 1 => Genre.HipHop
case 2 => Genre.RnB
case 3 => Genre.Soul
case 4 => Genre.Pop
case 5 => Genre.Rock
}
}
Injection[A, B]
import cats.Eq
import cats.implicits._
implicit val genreEq: Eq[Genre] = new Eq[Genre] {
def eqv(g1: Genre, g2: Genre): Boolean = g1 == g2
}
// Compiles!
val artistsWithGenre: TypedDataset[ArtistWithGenre] = TypedDataset.create(Seq(
ArtistWithGenre(Artist("Offset", 25), Genre.HipHop),
ArtistWithGenre(Artist("Kanye West", 39), Genre.HipHop),
ArtistWithGenre(Artist("Frank Ocean", 29), Genre.RnB),
ArtistWithGenre(Artist("John Mayer", 39), Genre.Rock),
ArtistWithGenre(Artist("Aretha Franklin", 74), Genre.Soul),
ArtistWithGenre(Artist("Kendrick Lamar", 29), Genre.HipHop),
ArtistWithGenre(Artist("Carly Rae Jepsen", 31), Genre.Pop)))
artistsWithGenre.filter(_.genre === Genre.HipHop).show().run
+-------------------+-----+
| artist|genre|
+-------------------+-----+
| [Offset,25]| 1|
| [Kanye West,39]| 1|
|[Kendrick Lamar,29]| 1|
+-------------------+-----+
Cats instances for Dataset[T]?
- Eh... it's tricky. (s/o @jeremyrsmith).
- Example: Functor[Dataset]
- Dataset#map needs an implicit Encoder[A]
- Can't use same trick as Functor[Future] with an implicit ExecutionContext
- Example: Monad[Dataset]
- Defined correctly, flatMap could be a Cartesian join
- Dataset[A] => Dataset[Dataset[B]]
- Read: very easy to blow up your Spark job
- Defined correctly, flatMap could be a Cartesian join
(Maybe there's a way out, but it's not obvious.)
¯\_(ツ)_/¯
In closing...
- Frameless needs users and contributors!
- Gitter: https://gitter.im/typelevel/frameless
- Examples from slides:
- Shoutouts to the real Frameless people:
type safe columns
Interested?
- What we do: data science @ Coatue
- Terabyte-scale data engineering
- Machine learning
- Quant trading
- NLP
- Stack
- Scala
- Spark
- AWS (S3, Redshift, etc.)
- R, Python
- Tableau
- Chat with me or email: lcao@coatue.com
- Twitter: @oacgnol
Frameless: A More Well-Typed Interface for Spark
By longcao
Frameless: A More Well-Typed Interface for Spark
Typelevel Summit 2017 - Brooklyn, NY
- 5,183