Avoiding Spark Pitfalls at Scale

About me

  • Software Engineer on data science team at Coatue Management
  • Scala for the last 5.5 years, data engineering for the last 2 years
  • Brooklyn! (by way of Texas)

(I really like coffee

and

whisky)

Yamazaki Distillery

Little Nap Coffee Stand

What this talk is not about

  • Machine learning, deep learning, AI
  • Real-time streaming
  • Whinefest about Spark's API

What this talk is about

  • Practical lessons learned from production Spark for large scale batch ETL
    • (and dirty, subversive tricks)
  • Spark 2.2.0
  • Know your data & code

  • Skew

  • Execution Engine

  • Configuration

  • API

  • Typeful APIs

Know your data & code

  • Know the cardinality and distribution of your data for:
    • group by operations
    • join keys
    • windowing operations
    • partition keys
  • Repartition to finer grain partitions prior to slow stages to reduce straggler task problem
val ds: Dataset[MyType] = spark.read.parquet(...)

ds.repartition(ds.rdd.getNumPartitions * 2)
  .map(slowFn)
  • Know your data & code

  • Skew

  • Execution Engine

  • Configuration

  • API

  • Typeful APIs

Skew

  • What is skew?
    • In the context of data engineering, when certain keys appear more often than others in an uneven fashion
    • Can adversely affect parallelization of tasks

Skew

  • Data skew: 90% of the hard problems we faced (other 10% probably java.io.Serializable)
    • Often the cause of dreaded OutOfMemory errors
  • Don't repartition on skewed keys
  • SortMergeJoin operations on skewed join keys can OOM unless you have good spill thresholds
    • Or, use a skew join implementation...

Skew

  • Skew join: isolated map (broadcast) join
    1. Bifurcate the LHS (left-hand side) and RHS:
      • Filter rows with keys that are 'hot' (appear over some hotKeyThreshold times) => (hotLHS, hotRHS)
      • Everything else => (coldLHS, coldRHS)
    2. Broadcast join hotLHS with hotRHS
    3. Regular join coldLHS with coldRHS
    4. Union results of 2 and 3
  • Assume that hotRHS is small enough to be broadcasted or tune the threshold
  • Know your data & code

  • Skew

  • Execution Engine

  • Configuration

  • API

  • Typeful APIs

Execution Engine

  • Timeouts with long-running stages and broadcast joins
    • Increase spark.sql.broadcastTimeout
val slowBroadcast = dataset
  .map(slowFunction)
  .joinWith(broadcast(otherDataset), joinKey)
  • Partitioning and coalescing
    • Beware of lazy coalescing, can reduce parallelism
val aggregation = dataset.repartition(1000)
  .groupBy(...)
  .agg(...)
  .coalesce(1) // will force upstream computations to 1 partition/task,
               // .repartition(1) instead

Execution Engine

  • Very long, lazily built queries
    • Beware: Janino codegen errors, Catalyst trouble
    • Can be hard to reason about performance
    • Dirty hack: force evaluations with actions
val veryLongQuery = dataset.groupBy(...)
  .agg(...)
  .map(processRow)
  .joinWith(anotherDataset)
  .filter(filterFn)
  .groupBy(...)
  .agg(...)

// Break up long query with caching and counting (low overhead strict action)

val query1 = dataset.groupBy(...)
  .agg.cache()

query1.count

val query2 = query1.map(processRow)
  .joinWith(anotherDataset).cache()

query2.count

// etc
  • Know your data & code

  • Skew

  • Execution Engine

  • Configuration

  • API

  • Typeful APIs

Configuration

  • The knobs are there (if you know where to dig)
    • Spark docs + source code + JIRA
  • Parallelism (partitions) - start at 2x available cores
    • spark.default.parallelism
    • spark.sql.shuffle.partitions
  • Spill thresholds - tune high enough not to spill to disk prematurely, but low enough to avoid OOMs
    • spark.shuffle.spill.numElementsForceSpillThreshold
    • spark.sql.windowExec.buffer.spill.threshold
    • spark.sql.sortMergeJoinExec.buffer.spill.threshold
    • SPARK-13450

Configuration

  • Dealing with slow or faulty nodes:
    • Speculation - spark.speculation.*
    • Blacklisting - spark.blacklist.*
    • Task reaping - spark.task.reaper.*
  • Dropped task reporting in Spark UI?
    • ​Tune spark.scheduler.listenerbus.eventqueue.size
    • SPARK-15703
  • Know your data & code

  • Skew

  • Execution Engine

  • Configuration

  • API

  • Typeful APIs

API

  • map vs. select
    • If just selecting columns, select is more performant
    • Plain Scala functions aren't optimized by Catalyst
  • Instance members are re-initialized when crossing serialization/encoding boundaries (e.g. shuffle)
    • Beware of expensive members:
case class MyData(a: String, b: Int) {
  val expensiveMember = { ... } // gets reinitialized
}
  • Know your data & code

  • Skew

  • Execution Engine

  • Configuration

  • API

  • Typeful APIs

Typeful APIs

  • QDSL (Quoted Domain Specific Language) for Spark
  • Spark SQL as target language

Interested?

  • What we do: data engineering @ Coatue
  • Stack
    • Scala (cats, shapeless, fs2, http4s, etc.)
    • Spark
    • AWS (EMR, Redshift, etc.)
    • R, Python
    • Tableau
  • Chat with me or email: lcao@coatue.com
  • Twitter: @oacgnol

Avoiding Spark Pitfalls at Scale

By longcao

Avoiding Spark Pitfalls at Scale

Scale by the Bay 2017: http://scale.bythebay.io/

  • 3,347