Data Pipelines with OpenWhisk
Jowanza Joseph
@jowanza

Agenda
- Problem Space
- Data Pipelines
- Serverless / Event Driven
- Apache OpenWhisk
- Project Architecture
- Kubernetes
- Tips
- Questions
The Problem Space

Ergonomics

The Problem Space
- How do we deploy new pipelines quickly?
- Have minimal impact on the system?
- Maintain custom environments and runtime?
- Maintain ergonomic simplicity?
Data Pipelines

Serverless

Benefits
- Cheap
- Quick to implement and isolate
- Event Driven
- Easy to take advantage of
Costs
- Cloud specific implementations
- Hard to test / version
- Language specific

Supported APIs
- Javascript
- Python
- PHP
- Swift
- Java
- DOCKER

More Stuff
- Sequences
- Packages
Architecture


- Load Balancing
- That's pretty much it
Controller

- Akka
- Service Mapping
- Job Queuing

- Which code to run
- History of code that has been run
- Logs
- Authentication details

- Service Discovery
- That's it

- Pub/Sub
- Exactly Once Delivery
- Retention
- Distributed
Invoker

- Docker Container
- Isolation
- Control
APIs
- Whisk CLI
- API Gateway
- Supports Versioning
- Supports packaging
Whisk CLI

Whisk API Gateway
https://{APIHOST}/api/v1/namespaces/{namespace}/actions
https://{APIHOST}/api/v1/namespaces/{namespace}/triggers
https://{APIHOST}/api/v1/namespaces/{namespace}/rules
https://{APIHOST}/api/v1/namespaces/{namespace}/packages
https://{APIHOST}/api/v1/namespaces/{namespace}/activations
Custom Whisk Actions


How It Works
- Train Models In Spark
- Bundle
- Run via MLeap Runtime
- Profit
How It Works

Benefits
- Type Safety
- Simplified Execution Context
- Similar APIs to Spark
Project Architecture

Really Nice Things
- Logging
- Performance Monitoring
- Shared Actions
- Scaling
Trade Offs
- Excellent Isolation
- Scalability
- Customizability
- OpenWhisk is a little hard
- Some of the semantics are hard to grasp
- MLeap Requires extra effort
A Word On Images


Base OpenWhisk Image

A Word On Deployment
Data Pipelines with OpenWhisk
By Jowanza Joseph
Data Pipelines with OpenWhisk
- 1,600