Data Pipelines with OpenWhisk
Jowanza Joseph
@jowanza
Agenda
- Problem Space
- Data Pipelines
- Serverless / Event DrivenĀ
- Apache OpenWhisk
- Project Architecture
- Kubernetes
- Tips
- Questions
The Problem Space
Ergonomics
The Problem Space
- How do we deploy new pipelines quickly?
- Have minimal impact on the system?
- Maintain custom environments and runtime?
- Maintain ergonomic simplicity?
Data Pipelines
Serverless
Benefits
- Cheap
- Quick to implement and isolate
- Event Driven
- Easy to take advantage of
Costs
- Cloud specific implementations
- Hard to test / version
- Language specific
Supported APIs
- Javascript
- Python
- PHP
- Swift
- Java
- DOCKER
More Stuff
- Sequences
- Packages
Architecture
- Load Balancing
- That's pretty much it
Controller
- Akka
- Service Mapping
- Job Queuing
- Which code to run
- History of code that has been run
- Logs
- Authentication details
- Service Discovery
- That's it
- Pub/Sub
- Exactly Once Delivery
- Retention
- Distributed
Invoker
- Docker Container
- Isolation
- Control
APIs
- Whisk CLI
- API Gateway
- Supports Versioning
- Supports packaging
Whisk CLI
Whisk API Gateway
https://{APIHOST}/api/v1/namespaces/{namespace}/actions
https://{APIHOST}/api/v1/namespaces/{namespace}/triggers
https://{APIHOST}/api/v1/namespaces/{namespace}/rules
https://{APIHOST}/api/v1/namespaces/{namespace}/packages
https://{APIHOST}/api/v1/namespaces/{namespace}/activations
Custom Whisk Actions
How It Works
- Train Models In Spark
- Bundle
- Run via MLeap Runtime
- Profit
How It Works
Benefits
- Type Safety
- Simplified Execution Context
- Similar APIs to Spark
Project Architecture
Really Nice Things
- Logging
- Performance Monitoring
- Shared Actions
- Scaling
Trade Offs
- Excellent Isolation
- Scalability
- Customizability
- OpenWhisk is a little hard
- Some of the semantics are hard to grasp
- MLeap Requires extra effort
A Word On Images
Base OpenWhisk Image
A Word On Deployment
Data Pipelines with OpenWhisk
By Jowanza Joseph
Data Pipelines with OpenWhisk
- 1,495