Transform and persist data in a form that can power insights and self-service BI
"Pipelines as software" talks
How to close the gap between Spark-based data pipeline applications and traditional software?
Takeaway points:
Abstractions:
Problem: Fragmented code base, lot of duplication, low test coverage
Solution: Use abstractions to encourage code reuse and to make pipelines more uniform - e.g. collect all DataFrameReaders in a shared library that are tested once - Session 1
Tests:
Problem: Most pipelines are not tested but even if they are, they are not designed to deal well with invalid data that will come at them
Solution: Unit test with spark-testing-base - Session 2
Solution: Add data validation rules to write-once, run-forever ETL jobs that send invalid data to a "dirty target"
"Optimising the Data Lake" talks
How to make sure that users of the data lake have access to the right data at low latencies, and can query it efficiently?
Takeaway points:
Spark SQL for analytics:
Problem: Multi-tenant DWHs (where storage and compute is tightly coupled) are hard to scale in a way that fits everyone, and make the isolation of issues impossible (all problems affect everyone)
Solution: Separate storage (DFS) from compute (Spark) and offer dedicated clusters for teams - Link
DFS file optimisations:
​Problem: Spark works much better with large files, that are in Parquet format, partitioned and sorted properly