Kubeflow
Data Science on Steroids
About us
saschagrunert
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/5801520/GitHub-Mark-Light-120px-plus.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/5801555/6BEBD8D9-61EA-46D1-94DB-C74230E48925.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/5801560/891AD106-5C76-41AE-BE10-CA666600BF9B.png)
mail@
.de
About us
mbu93
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/5801520/GitHub-Mark-Light-120px-plus.png)
The Evolution of Machine Learning
1950
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496588/snarc.png)
Stochastic Neural Analog Reinforcement Calculator (SNARC) Maze Solver
2000
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496609/torch.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496608/tf.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496605/pandas-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496606/pytorch-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496604/numpy.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496607/python-logo.png)
Data Science
Workflow
Data Source Handling
fixed set of technologies
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496622/jupyter.png)
Exploration
Regression?
Classification?
Supervised?
Unsupervised?
Baseline Modelling
Results
Trained Model
Results to share
Results
Deployment
Automation?
Infastructure?
Data Exploration
- select the data source
- collect statistical measures
- data cleaning
- typically done in notebook systems such as jupyter
import dill
with open("dataframe.dill", "rb") as fp:
dataframe = dill.load(fp)
dataframe.sample(frac=1).head(10)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/726151/images/6495253/pasted-from-clipboard.png)
Data Munging
- creating datasets
- preparing data for cross validation
from fastai.vision import ImageDataBunch
dataset = ImageDataBunch.from_folder("dataset",
train="training",
valid="test")
dataset
ImageDataBunch;
Train: LabelList (3040 items)
x: ImageList
Image (3, 17, 33), Image (3, 17, 33),
Image (3, 17, 33), Image (3, 17, 33),
Image (3, 17, 33)
y: CategoryList
3,3,3,3,3
Path: dataset;
Valid: LabelList (760 items)
x: ImageList
Image (3, 17, 33), Image (3, 17, 33),
Image (3, 17, 33), Image (3, 17, 33),
Image (3, 17, 33)
y: CategoryList
3,3,3,3,3
Path: dataset;
Test: None
Baseline modelling
- “Do we need supervised or unsupervised learning?”
- "Can we solve the problem with classification or regression?”
- most of the time experimental work
from fastai.vision import *
res18_learner = cnn_learner(dataset,
models.resnet18,
pretrained=False,
metrics=accuracy)
res18_learner.fit_one_cycle(5)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/726151/images/6495252/pasted-from-clipboard.png)
Evaluation and deployment
- check the models performance
- select and deploy a model
- create automated training process to update data and model
interp = res18_learner.interpret()
interp.plot_confusion_matrix(figsize=(8, 8))
![](https://s3.amazonaws.com/media-p.slid.es/uploads/726151/images/6495237/pasted-from-clipboard.png)
Yet Another Machine Learning Solution?
YES!
Cloud Native
Commercially available of-the-shelf (COTS) applications
The awareness of an application to run inside something like Kubernetes.
vs
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496699/kubernetes-logo.png)
“Can we improve
the classic data scientists workflow
by utilizing Kubernetes?”
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496717/kubeflow-logo.png)
announced 2017
abstracting machine learning best practices
Deployment and Test Setup
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496735/suse.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6229087/crio-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496771/logo.svg.png)
SUSE CaaS Platform
github.com/SUSE/skuba
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496820/openstack.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496735/suse.png)
> skuba cluster init --control-plane 172.172.172.7 caasp-cluster
> cd caasp-cluster
> skuba node bootstrap --target 172.172.172.7 caasp-master
> skuba node join --role worker --target 172.172.172.8 caasp-node-1
> skuba node join --role worker --target 172.172.172.24 caasp-node-2
> skuba node join --role worker --target 172.172.172.16 caasp-node-3
> cp admin.conf ~/.kube/config
> kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
caasp-master Ready master 2h v1.15.2 172.172.172.7 <none> SUSE Linux Enterprise Server 15 SP1 4.12.14-197.15-default cri-o://1.15.0
caasp-node-1 Ready <none> 2h v1.15.2 172.172.172.8 <none> SUSE Linux Enterprise Server 15 SP1 4.12.14-197.15-default cri-o://1.15.0
caasp-node-2 Ready <none> 2h v1.15.2 172.172.172.24 <none> SUSE Linux Enterprise Server 15 SP1 4.12.14-197.15-default cri-o://1.15.0
caasp-node-3 Ready <none> 2h v1.15.2 172.172.172.16 <none> SUSE Linux Enterprise Server 15 SP1 4.12.14-197.15-default cri-o://1.15.0
Storage Provisioner
> helm install nfs-client-provisioner \
-n kube-system \
--set nfs.server=caasp-node-1 \
--set nfs.path=/mnt/nfs \
--set storageClass.name=nfs \
--set storageClass.defaultClass=true \
stable/nfs-client-provisioner
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496856/helm-horizontal-white.png)
> kubectl -n kube-system get pods -l app=nfs-client-provisioner -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nfs-client-provisioner-777997bc46-mls5w 1/1 Running 3 2h 10.244.0.91 caasp-node-1 <none> <none>
Load Balancing
> kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.8.1/manifests/metallb.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
address-pools:
- name: default
protocol: layer2
addresses:
- 172.172.172.251-172.172.172.251
> kubectl apply -f metallb-config.yml
Deploying Kubeflow
> kfctl init kfapp --config=https://raw.githubusercontent.com/kubeflow/kubeflow/1fa142f8a5c355b5395ec0e91280d32d76eccdce/bootstrap/config/kfctl_existing_arrikto.yaml
> export KUBEFLOW_USER_EMAIL="sgrunert@suse.com"
> export KUBEFLOW_PASSWORD="my-strong-password"
> cd kfapp
> kfctl generate all -V
> kfctl apply all -V
available command line tool called kfctl
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1005350/images/6496901/kfctl_existing_arrikto-architecture.png)
Improving the Data Scientists Workflow
Machine Learning Pipelines
- Kubeflow provides reusable end-to-end machine learning workflows via pipelines
- pipeline components are built using Kubeflows Python SDK
- Every pipeline step is executed directly in Kubernetes within its own pod
Basic component using ContainerOp
- simply executes a given command within a container
- input and output can be passed around
from kfp.dsl import ContainerOp
step1 = ContainerOp(name='step1',
image='alpine:latest',
command=['sh', '-c'],
arguments=['echo "Running step"'])
Execution order
- Execution order can be made dependent
-
Arange components using op1.after(op2) syntax
step1 = ContainerOp(name='step1',
image='alpine:latest',
command=['sh', '-c'],
arguments=['echo "Running step"'])
step2 = ContainerOp(name='step2',
image='alpine:latest',
command=['sh', '-c'],
arguments=['echo "Running step"'])
step2.after(step1)
Creating a pipeline
-
Pipelines are created using the @pipeline decorator and compiled afterwards via compile()
from kfp.compiler import Compiler
from kfp.dsl import ContainerOp, pipeline
@pipeline(name='My pipeline', description='')
def pipeline():
step1 = ContainerOp(name='step1',
image='alpine:latest',
command=['sh', '-c'],
arguments=['echo "Running step"'])
step2 = ContainerOp(name='step2',
image='alpine:latest',
command=['sh', '-c'],
arguments=['echo "Running step"'])
step2.after(step1)
if __name__ == '__main__':
Compiler().compile(pipeline)
When should I use ContainerOp?
- Useful for deployment tasks
- Good option to query for new data
- Running of complex training scenarios (make use of training scripts)
Lightweight Python Components
- Directly run python functions in a container
- Limitations:
- need to be standalone (can't use code, imports or variables defined in another scope)
- Imported packages need to be available in the container image.
- Input parameters need to be set to str (default), int or float
Example: Sum of squares
- Code will be processed as a string and passed to the interpreter as below
from kfp.components import func_to_container_op
def sum_sq(a: int, b: int) -> int:
import numpy
dist = np.sqrt(a**2 + b**2)
return dist
add_op = func_to_container_op(sum_sq)(1, 2)
python -u -c "def sum_sq(a: int, b:int):\n\
import numpy\n\
dist = np.sqrt(a**2 + b**2)\n\
return dist\n(...)"
The need for storage
- Lightweight components only accept str, float or int inputs
- To pass objects around they must be stored/loaded
- For that purpose a volume must be attached
op = func_to_container_op(fit_squeezenet,
base_image="alpine:latest")("out/model", False,
"squeeze.model")
op.add_volume(
k8s.V1Volume(name='volume',
host_path=k8s.V1HostPathVolumeSource(
path='/data/out'))).add_volume_mount(
k8s.V1VolumeMount(name='volume', mount_path="out"))
Up- and Downsides of Lightweight Components
- Very useful to run simple functions
- Can reduce the effort to create pipelines
- May not be used for complex functions
- Practical alternative: Google Cloud Platforms options
Getting the pipeline to run
- Pipelines are compiled using dsl-compile script
- They are then transformed to an Argo workflow and executed
> sudo pip install https://storage.googleapis.com/ml-pipeline/release/0.1.27/kfp.tar.gz
> dsl-compile --py pipeline.py --output pipeline.tar.gz
Getting the pipeline to run
- Pipelines are compiled using dsl-compile script
- They are then transformed to an Argo workflow and executed
> kubectl get workflow
NAME AGE
my-pipeline-8lwcc 6m47s
Getting the pipeline to run
- Pipelines are compiled using dsl-compile script
- They are then transformed to an Argo workflow and executed
> argo get my-pipeline-8lwcc
Name: my-pipeline-8lwcc
Namespace: kubeflow
ServiceAccount: pipeline-runner
Status: Succeeded
Created: Tue Aug 27 13:06:06 +0200 (4 minutes ago)
Started: Tue Aug 27 13:06:06 +0200 (4 minutes ago)
Finished: Tue Aug 27 13:06:22 +0200 (4 minutes ago)
Duration: 16 seconds
STEP PODNAME DURATION MESSAGE
✔ my-pipeline-8lwcc
├-✔ step1 my-pipeline-8lwcc-818794353 6s
└-✔ step2 my-pipeline-8lwcc-768461496 7s
Running pipelines from Notebooks
- Create and run pipelines by importing the kfp package
- Can be usefull to save notebook ressources
- Enable developers to combine prototyping with creating an automated training workflow
from kfp.compiler import Compiler
pipeline_func = training_pipeline
pipeline_filename = pipeline_func.__name__ + '.pipeline.yaml'
Compiler().compile(pipeline_func, pipeline_filename)
Running pipelines from Notebooks
- Create and run pipelines by importing the kfp package
- Can be usefull to save notebook ressources
- Enable developers to combine prototyping with creating an automated training workflow
import kfp
client = kfp.Client()
try:
experiment = client.create_experiment("Prototyping")
except Exception:
experiment = client.get_experiment(experiment_name="Prototyping")
Running pipelines from Notebooks
- Create and run pipelines by importing the kfp package
- Can be usefull to save notebook ressources
- Enable developers to combine prototyping with creating an automated training workflow
arguments = {'pretrained': 'False'}
run_name = pipeline_func.__name__ + ' test_run'
run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)
Continuous Integration Lookout
Kubeflow provides a REST API
That’s it.
https://github.com/saschagrunert/
kubeflow-data-science-on-steroids
Kubeflow - Data Science on Steroids
By Sascha Grunert
Kubeflow - Data Science on Steroids
A presentation about Kubeflow
- 1,355