Building container images for data analysis

Johannes Köster

HPCW 2020

dataset

results

dataset

dataset

Define software stacks.

Build container images.

Use images for execution.

Issue:

Overhead, explosion of image variants.

Workarounds:

  • not using containers (🗲 reproducibility)
  • no fine-grained containers (🗲 transparency)

Conda package manager

  • language agnostic
  • thousands of available packages from all fields
  • de-facto standard in data sciences
channels:
  - conda-forge
dependencies:
  - matplotlib =3.1.2
  - seaborn =0.10.1
  - scikit-learn =0.23.1
  - python =3.8.1

Using conda package manager for

building blocks

Conda environment

definitions:

build