Big Data at the Intersection of Typed FP and Category Theory Abstract Algebra

About me

Software Engineer on data science team @ Coatue Management
Scala for the last 6 years, data engineering for the last 2.5
Based in New York, NY (Brooklyn)
Things I like:
- Coffee, beer, whiskey
- Travel
- Music (hip hop, synthwave)
- Sports (soccer, bouldering)
- Fountain pens (!!)
- Sneakers

Where I work

What we do: data engineering @ Coatue
- Terabyte scale, billions of rows
- Lambda architecture
- Typed functional programming
- NLP
Stack
- Scala (cats, shapeless, fs2, http4s, etc.)
- Spark
- AWS (EMR, Redshift, etc.)
- R, Python
- Tableau
Offices in NYC and Menlo Park, CA
Email: lcao@coatue.com, Twitter: @oacgnol

What this talk is about

Aggregations at small and large scales
- Small = local JVMs, List[T]
- Large = big data, e.g. Spark's Dataset[T] or RDD[T]
Semigroups and monoids for the working data engineer: how are they related to aggregation functions?
- How these abstractions enable code reuse
- Defined with laws: a set of axioms that instances of these abstractions (should) obey

First: an aggregation in SQL

> select * from my_table;

 name | x |  y   
------+---+------
 a    | 1 |  500
 b    | 2 | 1000
 c    | 3 | 5000

> select sum(x) from my_table;

6

Nice and simple.

Translation to local Scala

val nums = List(1,2,3)

val sum = nums.fold(0) { (x1, x2) =>
  x1 + x2
}

// sum = 6

Still pretty simple.

Translation to Spark

val nums: Dataset[Int] =
  spark.createDataset(List(1,2,3))

val sum = nums.rdd.fold(0) { (x1, x2) =>
  x1 + x2
}

// sum = 6

Folds (reduces) in Spark are distributed
- Parallelization across executors
- Fold within partitions first, then fold final results

Hmm...

val nums = spark.createDataset(List(1,2,3))

val sum = nums.rdd.fold(0) { (x1, x2) =>
  x1 + x2
}

// sum = 6

Can we abstract these folds away for reuse?

val nums = List(1,2,3)

val sum = nums.fold(0) { (x1, x2) =>
  x1 + x2
}

// sum = 6

First pass at a typeclass

trait Combinable[A] {
  def zero: A
  def combine(x: A, y: A): A
}

implicit val addCombinable = new Combinable[Int] {
  def zero: Int = 0
  def combine(x: Int, y: Int): Int = x + y
}

def fold[T](
  list: List[T]
)(implicit m: Combinable[T]): T = {
  list.fold(m.zero) { (t1, t2) =>
    m.combine(t1, t2)
  }
}

fold(List(1,2,3)) == 6

AKA: Monoid[A]

trait Monoid[A] {
  def empty: A
  def combine(x: A, y: A): A
}

implicit val addMonoid = new Monoid[Int] {
  def empty: Int = 0
  def combine(x: Int, y: Int): Int = x + y
}

def fold[T](
  list: List[T]
)(implicit m: Monoid[T]): T = {
  list.fold(m.zero) { (t1, t2) =>
    m.combine(t1, t2)
  }
}

fold(List(1,2,3)) == 6

Monoid laws

Laws: like an extra set of rules that instances should also to adhere to
From Scaladoc:
- List#fold: "a binary operator that must be associative."
- RDD#fold: "using a given associative function [...] for functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection"

Monoid laws

Associativity
- combine(combine(x, y), z) ==
  combine(x, combine(y, z))
Commutativity
- combine(x, y) == combine(y, x)
Relevant in distributed aggregation since order and associativity aren't necessarily guaranteed across partitions

Monoid laws

import cats.kernel.CommutativeMonoid

val addMonoid = new CommutativeMonoid[Int] {
  def empty: Int = 0
  def combine(x: Int, y: Int): Int = x + y
}

// Run through cats law testing
checkAll(
  "addingMonoid",
  CommutativeMonoidTests[Int](addMonoid)
    .commutativeMonoid)

Let's add commutativity to our monoid:

Monoid laws

[info] SimpleFoldsTest:
[info] - addingMonoid.commutativeMonoid.associative
[info] - addingMonoid.commutativeMonoid.collect0
[info] - addingMonoid.commutativeMonoid.combine all
[info] - addingMonoid.commutativeMonoid.combineAllOption
[info] - addingMonoid.commutativeMonoid.commutative
[info] - addingMonoid.commutativeMonoid.is id
[info] - addingMonoid.commutativeMonoid.left identity
[info] - addingMonoid.commutativeMonoid.repeat0
[info] - addingMonoid.commutativeMonoid.repeat1
[info] - addingMonoid.commutativeMonoid.repeat2
[info] - addingMonoid.commutativeMonoid.right identity

(yes, all the tests are green and passing)

✅

Monoids and Dataset

def fold[T](
  ds: Dataset[T]
)(implicit cm: CommutativeMonoid[T]): T = {
  ds.rdd.fold(cm.empty) { (t1, t2) =>
    cm.combine(t1, t2)
  }
}

Enforce commutativity as well on the typeclass instance:

Given we've now tried to follow Spark's guidance as much as possible, we trust the process to parallelize at runtime

How about a custom type?

case class MyType(
  name: String,
  x: Int,
  y: Option[Long])

// Can't really define empty for MyType
// Semigroup is like a Monoid without an `empty`
val cg = new CommutativeSemigroup[MyType] {
  override def combine(
    mt1: MyType,
    mt2: MyType
  ): MyType = {
    MyType(
      name = mt1.name, // pick a side... hmm...
      x = mt1.x combine mt2.x,
      y = mt1.y combine mt2.y)
  }
}

How about a custom type?

checkAll(
  "CommutativeSemigroup[MyType]",
  CommutativeSemigroupTests[MyType](cg)
    .commutativeSemigroup)

[info] SimpleCustomMonoidsTest:
[info] - CommutativeSemigroup[MyType].commutativeSemigroup.associative
[info] - CommutativeSemigroup[MyType].commutativeSemigroup.combineAllOption
[info] - CommutativeSemigroup[MyType].commutativeSemigroup.commutative *** FAILED ***
[info]   GeneratorDrivenPropertyCheckFailedException was thrown during property evaluation.
[info]    (Discipline.scala:14)
[info]     Falsified after 4 successful property evaluations.
[info]     Location: (Discipline.scala:14)
[info]     Occurred when passed generated values (
[info]       arg0 = MyType(儐,-193145333,None),
[info]       arg1 = MyType(,0,None)
[info]     )
[info]     Label of failing property:
[info]       Expected: MyType(,-193145333,None)
[info]   Received: MyType(儐,-193145333,None)

😢

How would I do this in SQL?

select
  name,
  sum(x) as x,
  sum(y) as y
from
  my_table
group by name;

Ultimately, I actually want to combine values for rows by key (name)

Monoid[List[MyType]]

Let's try translating that into a Monoid in Scala:

implicit val myTypeListMonoid = new Monoid[List[MyType]] {
  override def empty: List[MyType] = List.empty[MyType]

  override def combine(
    x: List[MyType],
    y: List[MyType]
  ): List[MyType] = {
    (x ++ y).groupBy(_.name)
      .map { case (k, myTypes) =>
        myTypes.reduce { (mt1, mt2) =>
          MyType(
            name = mt1.name,
            x = mt1.x combine mt2.x,
            y = mt1.y combine mt2.y)
        }
      }.toList
  }
}

Monoid[Dataset[MyType]]

implicit def myTypeDatasetMonoid(
  implicit spark: SparkSession
) = new Monoid[Dataset[MyType]] {
  import spark.implicits._

  override def empty: Dataset[MyType] =
    spark.emptyDataset[MyType]

  override def combine(
    ds1: Dataset[MyType],
    ds2: Dataset[MyType]
  ): Dataset[MyType] = {
    ds1.union(ds2).groupByKey(_.name)
      .reduceGroups { (mt1, mt2) =>
        MyType(
          name = mt1.name,
          x = mt1.x combine mt2.x,
          y = mt1.y combine mt2.y)
      }
      .map(_._2)
  }
}

Caveats

Having to provide concrete instances for each type and then each collection type (monomorphic)
Could we abstract more logic away to do this? (Hint: typeclasses, again)

Pulling out a common pattern

We are grouping by a key and combining only the values
Try a simple typeclass for this:

trait KeyValue[T, K, V] extends Serializable {
  def to(t: T): (K, V)
  def from(k: K, v: V): T
}

Pulling out a common pattern

object MyType {
  case class Values(
    x: Int,
    y: Option[Long])
}

implicit val kv =
  new KeyValue[MyType, String, MyType.Values] {
    def to(mt: MyType): (String, MyType.Values) =
      (mt.name, MyType.Values(mt.x, mt.y))
    def from(k: String, v: MyType.Values): MyType =
      MyType(k, v.x, v.y)
  }

Pulling out a common pattern

import cats.implicits._

// CommutativeSemigroup that combines
// only the _values_ of MyType
implicit val sg =
  new CommutativeSemigroup[MyType.Values] {
    override def combine(
      v1: MyType.Values,
      v2: MyType.Values
    ): MyType.Values = {
      MyType.Values(
        // delegate to cats Semigroup instances
        x = v1.x combine v2.x,
        y = v1.y combine v2.y)
    }
  }

Now try the new approach...

implicit def kvListMonoid[T, K, V: CommutativeSemigroup](
  implicit kver: KeyValue[T, K, V]
) = new Monoid[List[T]] {
  override def empty: List[T] = List.empty[T]

  override def combine(
    x: List[T],
    y: List[T]): List[T] = {
    (x ++ y).map(kver.to)
      .groupBy(_._1)
      .map { case (k, kvs) =>
        val combined: V = kvs
          .map(_._2)
          .reduce(_ combine _)

        kver.from(k, combined)
      }.toList
  }
}

New Monoid[Dataset[T]

def kvDatasetMonoid[T: Encoder, K: Encoder, V: Encoder: CommutativeSemigroup](
  implicit kver: KeyValue[T, K, V], spark: SparkSession
) = new Monoid[Dataset[T]] {
  import cats.implicits._

  private val tupleEncoder: Encoder[(K, V)] = Encoders.tuple[K, V](
    implicitly[Encoder[K]],
    implicitly[Encoder[V]])

  override def empty: Dataset[T] = spark.emptyDataset[T]

  override def combine(ds1: Dataset[T], ds2: Dataset[T]): Dataset[T] = {
    ds1.union(ds2).map(kver.to(_))(tupleEncoder)
      .groupByKey((kv: (K, V)) => kv._1)
      .reduceGroups { (kv1: (K, V), kv2: (K, V)) =>
        val (k1, v1) = kv1
        val (k2, v2) = kv2
        (k1, v1 combine v2)
      }
      .map { kkv: (K, (K, V)) =>
        val (k, (kv)) = kkv
        kver.from(k, kv._2)
      }
  }
}

😅

What did we just do?

Reusable CommutativeSemigroup[V]s
Reusable combines for List[T] and Dataset[T]

Caveats, other ideas

The new monoids we've just defined are a little finicky with respect to law checking (see the talk repo to be linked)
- Note: sometimes unlawful instances can be useful (see: alleycats)
- Also probably need performance tuning
Shapeless for automatic typeclass derivation?
Other big data frameworks:
- Flink, Hadoop, Beam, etc.

Thanks for listening!

Big Data at the Intersection of Typed FP and Abstract Algebra

By longcao

Big Data at the Intersection of Typed FP and Abstract Algebra

Typelevel Summit Boston 2018: https://typelevel.org/event/2018-03-summit-boston/

7 years ago
2,592

Big Data at the Intersection of Typed FP and Category Theory Abstract Algebra

About me

Where I work

What this talk is about

First: an aggregation in SQL

Translation to local Scala

Translation to Spark

Hmm...

First pass at a typeclass

AKA: Monoid[A]

Monoid laws

Monoid laws

Monoid laws

Monoid laws

Monoids and Dataset

How about a custom type?

How about a custom type?

How would I do this in SQL?

Monoid[List[MyType]]

Monoid[Dataset[MyType]]

Caveats

Pulling out a common pattern

Pulling out a common pattern

Pulling out a common pattern

Now try the new approach...

New Monoid[Dataset[T]

What did we just do?

Caveats, other ideas

Further reading

Thanks for listening!

Big Data at the Intersection of Typed FP and Abstract Algebra

More from longcao