Spark Summit Recap

David Mariassy

Organised by Databricks, which was founded by Matei Zaharia, the original creator of Spark
1 day of training, 2 days of talks
7 tracks of talks, over 120 presentations in total and only a fraction of them are shameless sales pitches 🙂
Around 2000 attendees

Transformation logic:
- Complexity: NA
- Owner: NA
Infrastructure:
- Components: Kafka, Kafka Producers/Connectors
- Owner: Central DevOps Team
DE Responsibilities - Low:
- Onboard new sources needed for stakeholders' data analysis

Purpose: Low-latency change data capture system that all teams/services in the company can subscribe to

Transformation logic:
- Complexity: Low for T, high for L
- Owner: Data Engineers
Infrastructure:
- Components: ETL Execution Engine, Distributed File System, Data Catalog, ACLs, Scheduler, Monitoring
- Owners: Data Engineers
DE Responsibilities - High:
- Full oversight

Transformation logic:
- Complexity: High
- Owner: Data Analysts / Scientists
Infrastructure:
- Components: SQL Execution Engine, Distributed File System, Data Catalog, ACLs, Scheduler, Monitoring
- Owners: Data Engineers
DE Responsibilities - High:
- Infrastructure management for others
- Advisory role

Purpose:
- Transform and persist data in a form that can power insights and self-service BI

How to close the gap between Spark-based data pipeline applications and traditional software?
Takeaway points:
- Abstractions:
  - Problem: Fragmented code base, lot of duplication, low test coverage
  - Solution: Use abstractions to encourage code reuse and to make pipelines more uniform - e.g. collect all DataFrameReaders in a shared library that are tested once - Session 1
- Tests:
  - Problem: Most pipelines are not tested but even if they are, they are not designed to deal well with invalid data that will come at them
  - Solution: Unit test with spark-testing-base - Session 2
  - Solution: Add data validation rules to write-once, run-forever ETL jobs that send invalid data to a "dirty target"

How to make sure that users of the data lake have access to the right data at low latencies, and can query it efficiently?
Takeaway points:
- Spark SQL for analytics:
  - Problem: Multi-tenant DWHs (where storage and compute is tightly coupled) are hard to scale in a way that fits everyone, and make the isolation of issues impossible (all problems affect everyone)
  - Solution: Separate storage (DFS) from compute (Spark) and offer dedicated clusters for teams - Link
- DFS file optimisations:
  - Problem: Spark works much better with large files, that are in Parquet format, partitioned and sorted properly
  - Session 1, Session 2