Demystifying Containers

About Me

saschagrunert

mail@

.de

About the Series

  • series of blog posts and corresponding talks
  • all about containers from a historic perspective

Part I: Kernel Space

First talk scoped to Linux kernel related topics

Isolated groups of processes running on a single host, which fulfill a set of “common” features.

chroot

1979

Change the root directory of the current running process

(and its children)

A jail is not a security solution

  • current working directory is left unchanged when used via syscall
  • relative paths can still refer to files outside of the new root
  • changes only the root path and nothing else
  • chroot does not stack
  • only CAP_SYS_CHROOT capability needed

pivot_root

prefferred to chroot nowadays

separates old mounts into dedicated directory

rootfs needed for more useful jails

can be extracted from existing container image

Linux Namespaces

2002

wrap certain global system resources in an abstraction layer

Linux version 3.8 in 2013 made namespaces “container ready”

Seven distinct namespaces available: mnt, pid, net, ipc, uts, user and cgroup

API

three main system calls

proc

populates additional namespace related files

/proc/$PID/ns contains symbolic links to namespaces

https://github.com/karelzak/util-linux

 

contains dedicated wrapper programs for the mentioned syscalls, like

lsns

 util-linux 

Mount (mnt)

2002

isolate a set of mount points by a group of processes

CLONE_NEWNS

memory resides in Virtual File System (VFS) 

namespace gets destroyed: memory is unrecoverable lost

keep a file handle on /proc/$PID/ns/mnt

create flexible container filesystem trees

Great read:

shared subtree documentation of the Linux kernel

https://www.kernel.org/doc/Documentation/filesystems/
sharedsubtree.txt

UNIX Time-Sharing System (uts)

2006

unshare the domain- and hostname from the current host system

CLONE_NEWUTS

Interprocess Communication (ipc)

2006

isolate interprocess communication resources:

System V IPC objects and POSIX message queues

CLONE_NEWIPC

Process ID (pid)

2008

gives processes an independent set of process identifiers (PIDs)

CLONE_NEWPID

Network (net)

2009

virtualize the network stack

CLONE_NEWNET

each namespace contains own resource properties within /proc/net

 

contains only a loopback interface on initial creation

interfaces can be moved between namespaces

 

private set of IP addresses, own routing table, socket listing, connection tracking table, firewall, …

User ID (user)

2012

isolation of user and group IDs

 

since Linux 3.8 without being fully privileged

CLONE_NEWUSER

Use case: unprivileged user outside a namespace while being fully privileged inside

  • /proc/$PID/{u,g}id_map expose mappings for user and group IDs
  • can be written once to define the mappings

0

inside

outside

length

1000

1

Example mapping

Control Group (cgroup)

2016

resource limiting, prioritization, accounting and controlling

 

initial implementation in 2008

major redesign started from 2013

CLONE_NEWCGROUP

Composing Namespaces

makes “containers“ possible

Running a Container

use runc

https://github.com/opencontainers/runc

to run a container from the extracted rootfs

Conclusion

Linux has great isolation techniques built in and a container runtime uses all these isolation features

 

NAMESPACES(7)

http://man7.org/linux/man-pages/man7/namespaces.7.html

 

Future topics: runtimes, security, images and orchestration

That’s it.

https://github.com/

saschagrunert/demystifying-containers

Made with Slides.com