Spark Summit Recap

David Mariassy

Conference Facts

  • Organised by Databricks, which was founded by Matei Zaharia, the original creator of Spark
  • 1 day of training, 2 days of talks
  • 7 tracks of talks, over 120 presentations in total and only a fraction of them are shameless sales pitches 🙂
  • Around 2000 attendees

My Spark Summit

  • From the 7 tracks offered, I focused on:
    • Spark Experience and Use Cases
      • Lessons from the field
      • Battle-tested architectural patterns
      • Common issues and their solutions
    • Developer
      • Productivity enhancements, best practices
    • Technical Deep Dives
      • Spark optimisation techniques

De facto industry consensus on architecture

De facto industry consensus on architecture

  • Transformation logic:
    • Complexity: NA
    • Owner: NA
  • Infrastructure:
    • Components: Kafka, Kafka Producers/Connectors
    • Owner: Central DevOps Team
  • DE Responsibilities - Low:
    • Onboard new sources needed for stakeholders' data analysis
  • Purpose: Low-latency change data capture system that all teams/services in the company can subscribe to

De facto industry consensus on architecture

  • Transformation logic:
    • Complexity: Low for T, high for L
    • Owner: Data Engineers
  • Infrastructure:
    • Components: ETL Execution Engine, Distributed File System, Data Catalog, ACLs, Scheduler, Monitoring
    • Owners: Data Engineers
  • DE Responsibilities - High:
    • ​Full oversight
  • Purpose:
    • Persist all events in a way that supports efficient data access

De facto industry consensus on architecture

  • Transformation logic:
    • Complexity: High
    • Owner: Data Analysts / Scientists
  • Infrastructure:
    • Components: SQL Execution Engine, Distributed File System, Data Catalog, ACLs, Scheduler, Monitoring
    • Owners: Data Engineers
  • DE Responsibilities - High:
    • ​Infrastructure management for others
    • Advisory role
  • Purpose:
    • Transform and persist data in a form that can power insights and self-service BI

"Pipelines as software" talks

  • How to close the gap between Spark-based data pipeline applications and traditional software?
  • Takeaway points:
    • Abstractions:
      • Problem: Fragmented code base, lot of duplication, low test coverage
      • Solution: Use abstractions to encourage code reuse and to make pipelines more uniform - e.g. collect all DataFrameReaders in a shared library that are tested once - Session 1
    • Tests:
      • Problem: Most pipelines are not tested but even if they are, they are not designed to deal well with invalid data that will come at them
      • Solution: Unit test with spark-testing-base - Session 2
      • Solution: Add data validation rules to write-once, run-forever ETL jobs that send invalid data to a "dirty target"

"Optimising the Data Lake" talks

  • How to make sure that users of the data lake have access to the right data at low latencies, and can query it efficiently?
  • Takeaway points:
    • Spark SQL for analytics:
      • Problem: Multi-tenant DWHs (where storage and compute is tightly coupled) are hard to scale in a way that fits everyone, and make the isolation of issues impossible (all problems affect everyone)
      • Solution: Separate storage (DFS) from compute (Spark) and offer dedicated clusters for teams - Link
    • DFS file optimisations:
      • ​Problem: Spark works much better with large files, that are in Parquet format, partitioned and sorted properly
      • Session 1, Session 2

Links

Questions?

Spark Summit Recap

By david-mariassy

Spark Summit Recap

  • 311