Demystifying Containers

About Me

saschagrunert

mail@

.de

About the Series

series of blog posts and corresponding talks
all about containers from a historic perspective

Part I: Kernel Space

First talk scoped to Linux kernel related topics

Isolated groups of processes running on a single host, which fulfill a set of “common” features.

chroot

1979

Change the root directory of the current running process

(and its children)

A jail is not a security solution

current working directory is left unchanged when used via syscall
relative paths can still refer to files outside of the new root
changes only the root path and nothing else
chroot does not stack
only CAP_SYS_CHROOT capability needed

pivot_root

prefferred to chroot nowadays

separates old mounts into dedicated directory

rootfs needed for more useful jails

can be extracted from existing container image

Linux Namespaces

2002

wrap certain global system resources in an abstraction layer

Linux version 3.8 in 2013 made namespaces “container ready”

Seven distinct namespaces available: mnt, pid, net, ipc, uts, user and cgroup

API

three main system calls

proc

populates additional namespace related files

/proc/$PID/ns contains symbolic links to namespaces

https://github.com/karelzak/util-linux

contains dedicated wrapper programs for the mentioned syscalls, like

lsns

util-linux

Mount (mnt)

2002

isolate a set of mount points by a group of processes

CLONE_NEWNS

memory resides in Virtual File System (VFS)

namespace gets destroyed: memory is unrecoverable lost

keep a file handle on /proc/$PID/ns/mnt

create flexible container filesystem trees

Great read:

shared subtree documentation of the Linux kernel

https://www.kernel.org/doc/Documentation/filesystems/
sharedsubtree.txt

UNIX Time-Sharing System (uts)

2006

unshare the domain- and hostname from the current host system

CLONE_NEWUTS

Interprocess Communication (ipc)

2006

isolate interprocess communication resources:

System V IPC objects and POSIX message queues

CLONE_NEWIPC

Process ID (pid)

2008

gives processes an independent set of process identifiers (PIDs)

CLONE_NEWPID

Network (net)

2009

virtualize the network stack

CLONE_NEWNET

each namespace contains own resource properties within /proc/net

contains only a loopback interface on initial creation

interfaces can be moved between namespaces

private set of IP addresses, own routing table, socket listing, connection tracking table, firewall, …

User ID (user)

2012

isolation of user and group IDs

since Linux 3.8 without being fully privileged

CLONE_NEWUSER

Use case: unprivileged user outside a namespace while being fully privileged inside

/proc/$PID/{u,g}id_map expose mappings for user and group IDs
can be written once to define the mappings

0

inside

outside

length

1000

1

Example mapping

Control Group (cgroup)

2016

resource limiting, prioritization, accounting and controlling

initial implementation in 2008

major redesign started from 2013

CLONE_NEWCGROUP

Composing Namespaces

makes “containers“ possible

Running a Container

use runc

https://github.com/opencontainers/runc

to run a container from the extracted rootfs

Conclusion

Linux has great isolation techniques built in and a container runtime uses all these isolation features

NAMESPACES(7)

http://man7.org/linux/man-pages/man7/namespaces.7.html

Future topics: runtimes, security, images and orchestration

That’s it.

https://github.com/

saschagrunert/demystifying-containers

Demystifying Containers - Part I: Kernel Space

By Sascha Grunert

Demystifying Containers - Part I: Kernel Space

A series of blog posts and talks about the world of containers

6 years ago
2,067

Sascha Grunert

saschagrunert

Demystifying Containers

About Me

About the Series

Part I: Kernel Space

chroot

1979

A jail is not a security solution

pivot_root

Linux Namespaces

2002

API

proc

util-linux

Mount (mnt)

2002

UNIX Time-Sharing System (uts)

2006

Interprocess Communication (ipc)

2006

Process ID (pid)

2008

Network (net)

2009

User ID (user)

2012

Example mapping

Control Group (cgroup)

2016

Composing Namespaces

Running a Container

Conclusion

That’s it.

Demystifying Containers - Part I: Kernel Space

More from Sascha Grunert