Optimizing Docker builds for Python applications

Dmitry Figol

Systems Engineer, Cisco Systems

@dmfigol

Slides

Docker terminology

Container

  • A lightweight way to package an application with its dependencies
  • Different containers have separate user-space but share the kernel of the host

Docker image

  • Template to create Docker containers
  • Created from Dockerfile
  • Consists of read-only layers
  • Can be uploaded to registry and shared with others

Dockerfile

  • A set of instructions to build an image
  • Starts with a base image
  • Every* instruction creates a new layer which is cached for future builds
FROM debian:stretch

COPY test.txt test.txt

RUN touch file.txt

CMD ["date"]

Docker container

  • Created from Docker image, a writable layer on top is added
  • Resources are allocated
  • Entrypoint/CMD are executed at the start of a container

Registry

A place to store and share tagged images

Dockerfile

Dockerfile

Image

build, tag

Registry

push/pull

Container

run

CMD/Entrypoint

Resources (storage, networking, etc.)

Focus of this session

Python + Docker

Simplest Dockerfile for Python app

FROM python:3.7

WORKDIR /app

COPY . .

RUN pip install -r requirements.txt

CMD ["python", "main.py"]

976 Mb

-> % tree .
.
├── Dockerfile
├── main.py
├── my_project
│   ├── __init__.py
│   └── greet.py
├── poetry.lock
├── pyproject.toml
└── requirements.txt
-> % cat requirements.txt
requests
cryptography

Optimization objectives

  • Reducing image size
  • Reducing initial and subsequent build time

Priorities

  • Fast builds during development
  • Small image size for production releases

Selecting base image

Image Size Notes
python:3.7 / python:3.7-stretch 929 Mb Uses glibc and supports manylinux wheels
python:3.7-slim-stretch 147 Mb
python:3.7-alpine 87 Mb Uses musl and does not support manylinux wheels Python extensions should be compiled
Dependencies take less space

Use slim-stretch as base

when you care about build time

Use alpine as base

when you care about image size

slim-stretch

FROM python:3.7-slim-stretch

WORKDIR /app

COPY . .

RUN pip install -r requirements.txt

CMD ["python", "main.py"]

976193 Mb

alpine

FROM python:3.7-alpine

WORKDIR /app

COPY . .

RUN apk add --no-cache \
    build-base \
    gcc \
    libffi-dev \
    openssl-dev
RUN pip install -r requirements.txt

CMD ["python", "main.py"]

976317 Mb

???

Problem

Build dependencies, contributing to the image size, are needed for compilation but not the runtime

Include only necessary files

Copying the source code

  • More specific COPY statements instead of broad "COPY . ."
  • Use .dockerignore to exclude some files when doing COPY

.dockerignore example

**/*.pyc
**/*.pyo
**/*.log
**/__pycache__
docs/_build
**/.ipynb_checkpoints
.venv/
.mypy_cache/
.pytest_cache/
.tox/
**/*.egg-info
pip-wheel-metadata/

slim-stretch

FROM python:3.7-slim-stretch

WORKDIR /app

COPY my_project my_project
COPY main.py .
COPY requirements.txt .

RUN pip install -r requirements.txt

CMD ["python", "main.py"]

193170 Mb

alpine

FROM python:3.7-alpine

WORKDIR /app

COPY my_project my_project
COPY main.py .
COPY requirements.txt .

RUN apk add --no-cache \
    build-base \
    gcc \
    libffi-dev \
    openssl-dev
RUN pip install -r requirements.txt

CMD ["python", "main.py"]

317294 Mb

Remove unnecessary files

alpine

FROM python:3.7-alpine

WORKDIR /app

COPY my_project my_project
COPY main.py .
COPY requirements.txt .

RUN apk add --no-cache \
    build-base \
    gcc \
    libffi-dev \
    openssl-dev
RUN pip install -r requirements.txt
RUN apk del build-base \
    gcc \
    libffi-dev \
    openssl-dev

CMD ["python", "main.py"]

294294 Mb

???

Docker Layers

  • Instructions create read-only layers
  • A new layer can't be smaller than the previous layer
  • Layers are cached and can be re-used for subsequent builds
  • Layers introduce some overhead

Tips

  • Combine multiple RUN statements into a single one
  • If you need to delete files, make sure to delete them in the same layer (instruction) where they were added
  • To benefit from caching, arrange statements in the order from the least changing to the most changing (usually, system-level dependencies and tools, Python dependencies, source code)
  • Don't save dependencies to cache (pip --no-cache-dir option, apk --no-cache option)

slim-stretch

FROM python:3.7-slim-stretch

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY my_project my_project
COPY main.py .

CMD ["python", "main.py"]

170→166 Mb

alpine

FROM python:3.7-alpine

WORKDIR /app

ARG BUILD_DEPS="build-base gcc libffi-dev openssl-dev"
ARG RUNTIME_DEPS="libcrypto1.1 libssl1.1"

COPY requirements.txt .

RUN apk add --no-cache --virtual .build-deps ${BUILD_DEPS} \
 && pip install --no-cache-dir -r requirements.txt \
 && apk del .build-deps \
 && apk add --no-cache ${RUNTIME_DEPS}

COPY my_project my_project
COPY main.py .

CMD ["python", "main.py"]

294106 Mb

(Optional) Delete *.pyc files / tests from dependencies

FROM python:3.7-alpine

WORKDIR /app

ARG BUILD_DEPS="build-base gcc libffi-dev openssl-dev"
ARG RUNTIME_DEPS="libcrypto1.1 libssl1.1"

COPY requirements.txt .

RUN apk add --no-cache --virtual .build-deps ${BUILD_DEPS} \
 && pip install --no-cache-dir -r requirements.txt \
 && apk del .build-deps \
 && apk add --no-cache ${RUNTIME_DEPS} \
 && find /usr/local \
        \( -type d -a -name test -o -name tests \) \
        -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \
        -exec rm -rf '{}' \+

COPY my_project my_project
COPY main.py .

CMD ["python", "main.py"]

10697 Mb

Disadvantages

  • Complex Dockerfile
  • No benefit from layer caching

Docker multi-stage

  • Build an intermediate image with all build dependencies and install your application
  • Copy the result (e.g. binary) to a fresh image and label it as a final image

Why?

  • Resulting image is smaller (no build dependencies)
  • Could be faster if the layers with build dependencies are cached

Are multi-stage builds relevant to Python apps?

Somewhat

Python is an interpreted language

Idea: use virtual environments to simplify copy between stages

# Stage 1 - Install build dependencies
FROM python:3.7-alpine AS builder

WORKDIR /app

ARG BUILD_DEPS="build-base gcc libffi-dev openssl-dev"

RUN apk add --no-cache ${BUILD_DEPS} \
 && python -m venv .venv \
 && .venv/bin/pip install --no-cache-dir -U pip setuptools

COPY requirements.txt .

RUN .venv/bin/pip install --no-cache-dir -r requirements.txt \
 && find /app/.venv \
        \( -type d -a -name test -o -name tests \) \
        -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \
        -exec rm -rf '{}' \+
# Stage 2 - Copy only necessary files to the runner stage
FROM python:3.7-alpine

WORKDIR /app

ARG RUNTIME_DEPS="libcrypto1.1 libssl1.1"
RUN apk add --no-cache ${RUNTIME_DEPS}

COPY --from=builder /app /app
COPY my_project my_project
COPY main.py .

ENV PATH="/app/.venv/bin:$PATH"

CMD ["python", "main.py"]

Python + Docker multi-stage

97101 Mb

Idea: Build a custom image with your common build dependencies and tools and store it in the registry

FROM registry.gitlab.com/dmfigol/base-docker-images/python:3.7-alpine AS builder

WORKDIR /app

COPY requirements.txt .

RUN .venv/bin/pip install --no-cache-dir -r requirements.txt \
 && find /app/.venv \
        \( -type d -a -name test -o -name tests \) \
        -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \
        -exec rm -rf '{}' \+

FROM python:3.7-alpine

WORKDIR /app

ARG RUNTIME_DEPS="libcrypto1.1 libssl1.1"
RUN apk add --no-cache ${RUNTIME_DEPS}

COPY --from=builder /app /app
COPY my_project my_project
COPY main.py .

ENV PATH="/app/.venv/bin:$PATH"

CMD ["python", "main.py"]

Miscellaneous

Bind mount source code instead of COPY in local dev environment

Add the following environmental variables:

  • PYTHONUNBUFFERED=1  # print to stdout without buffering
  • PYTHONDONTWRITEBYTECODE=1  # don't generate *.pyc files
# Stage 1 - Install build dependencies
FROM python:3.7-alpine AS builder

WORKDIR /app
ENV PATH="/root/.poetry/bin:$PATH"
ARG BUILD_DEPS="build-base gcc libffi-dev openssl-dev git curl"

RUN apk add --no-cache ${BUILD_DEPS} \
 && curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python \
 && python -m venv .venv \
 && poetry config settings.virtualenvs.in-project true \
 && .venv/bin/pip install --no-cache-dir -U pip setuptools

COPY pyproject.toml .
COPY poetry.lock .

RUN poetry install --no-dev --no-interaction \
 && find /app/.venv \
        \( -type d -a -name test -o -name tests \) \
        -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \
        -exec rm -rf '{}' \+

COPY my_project my_project

# Install the project as a package
RUN poetry install --no-dev --no-interaction
# Stage 2 - Copy only necessary files to the runner stage
FROM python:3.7-alpine

WORKDIR /app

ARG RUNTIME_DEPS="libcrypto1.1 libssl1.1"
RUN apk add --no-cache ${RUNTIME_DEPS}

COPY --from=builder /app /app
COPY main.py .

ENV PATH="/app/.venv/bin:$PATH" \
    PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1

CMD ["python", "main.py"]

Example of a Dockerfile using Poetry

Summary

  • Select base image carefully:
    • alpine for smaller image size
    • slim-stretch for faster builds
  • Take into account layer caching
    • Combine different statements into one
    • Delete files in the same statement where they were added
    • Order statements from the least to the most changing
  • Docker multi-stage can help you avoid complex deletions and benefit from caching
    • Usage of Python virtual environment is recommended in this case

Thank you!

@dmfigol

Made with Slides.com