Spark Summit Recap
David Mariassy
Conference Facts
- Organised by Databricks, which was founded by Matei Zaharia, the original creator of Spark
- 1 day of training, 2 days of talks
- 7 tracks of talks, over 120 presentations in total and only a fraction of them are shameless sales pitches 🙂
- Around 2000 attendees
My Spark Summit
-
From the 7 tracks offered, I focused on:
-
Spark Experience and Use Cases
- Lessons from the field
- Battle-tested architectural patterns
- Common issues and their solutions
-
Developer
- Productivity enhancements, best practices
-
Technical Deep Dives
- Spark optimisation techniques
-
Spark Experience and Use Cases
De facto industry consensus on architecture
De facto industry consensus on architecture
- Transformation logic:
- Complexity: NA
- Owner: NA
- Infrastructure:
- Components: Kafka, Kafka Producers/Connectors
- Owner: Central DevOps Team
- DE Responsibilities - Low:
- Onboard new sources needed for stakeholders' data analysis
- Purpose: Low-latency change data capture system that all teams/services in the company can subscribe to
De facto industry consensus on architecture
- Transformation logic:
- Complexity: Low for T, high for L
- Owner: Data Engineers
- Infrastructure:
- Components: ETL Execution Engine, Distributed File System, Data Catalog, ACLs, Scheduler, Monitoring
- Owners: Data Engineers
-
DE Responsibilities - High:
- ​Full oversight
- Purpose:
- Persist all events in a way that supports efficient data access
De facto industry consensus on architecture
- Transformation logic:
- Complexity: High
- Owner: Data Analysts / Scientists
- Infrastructure:
- Components: SQL Execution Engine, Distributed File System, Data Catalog, ACLs, Scheduler, Monitoring
- Owners: Data Engineers
-
DE Responsibilities - High:
- ​Infrastructure management for others
- Advisory role
- Purpose:
- Transform and persist data in a form that can power insights and self-service BI
"Pipelines as software" talks
- How to close the gap between Spark-based data pipeline applications and traditional software?
-
Takeaway points:
-
Abstractions:
- Problem: Fragmented code base, lot of duplication, low test coverage
- Solution: Use abstractions to encourage code reuse and to make pipelines more uniform - e.g. collect all DataFrameReaders in a shared library that are tested once - Session 1
-
Tests:
- Problem: Most pipelines are not tested but even if they are, they are not designed to deal well with invalid data that will come at them
- Solution: Unit test with spark-testing-base - Session 2
- Solution: Add data validation rules to write-once, run-forever ETL jobs that send invalid data to a "dirty target"
-
Abstractions:
"Optimising the Data Lake" talks
- How to make sure that users of the data lake have access to the right data at low latencies, and can query it efficiently?
-
Takeaway points:
-
Spark SQL for analytics:
- Problem: Multi-tenant DWHs (where storage and compute is tightly coupled) are hard to scale in a way that fits everyone, and make the isolation of issues impossible (all problems affect everyone)
- Solution: Separate storage (DFS) from compute (Spark) and offer dedicated clusters for teams - Link
- DFS file optimisations:
-
Spark SQL for analytics:
Links
- Videos and slides from all talks are available on the Spark Summit website: https://databricks.com/sparkaisummit/europe/schedule
- I uploaded my own conference notes to Confluence: https://yourdelivery.atlassian.net/wiki/spaces/PI/pages/382533794/Spark+Summit+Notes
Questions?
Spark Summit Recap
By david-mariassy
Spark Summit Recap
- 311