Demystifying Containers
About Me
saschagrunert
mail@
.de
About the Series
- series of blog posts and corresponding talks
- all about containers from a historic perspective
Part I: Kernel Space
First talk scoped to Linux kernel related topics
Isolated groups of processes running on a single host, which fulfill a set of “common” features.
chroot
1979
Change the root directory of the current running process
(and its children)
A jail is not a security solution
- current working directory is left unchanged when used via syscall
- relative paths can still refer to files outside of the new root
- changes only the root path and nothing else
- chroot does not stack
- only
CAP_SYS_CHROOT
capability needed
pivot_root
prefferred to chroot
nowadays
separates old mounts into dedicated directory
rootfs needed for more useful jails
can be extracted from existing container image
Linux Namespaces
2002
wrap certain global system resources in an abstraction layer
Linux version 3.8 in 2013 made namespaces “container ready”
Seven distinct namespaces available: mnt
, pid
, net
, ipc
, uts
, user
and cgroup
API
three main system calls
proc
/proc/$PID/ns
contains symbolic links to namespaces
https://github.com/karelzak/util-linux
contains dedicated wrapper programs for the mentioned syscalls, like
lsns
util-linux
Mount (mnt)
2002
isolate a set of mount points by a group of processes
CLONE_NEWNS
memory resides in Virtual File System (VFS)
namespace gets destroyed: memory is unrecoverable lost
keep a file handle on /proc/$PID/ns/mnt
create flexible container filesystem trees
Great read:
shared subtree documentation of the Linux kernel
https://www.kernel.org/doc/Documentation/filesystems/
sharedsubtree.txt
UNIX Time-Sharing System (uts)
2006
unshare the domain- and hostname from the current host system
CLONE_NEWUTS
Interprocess Communication (ipc)
2006
isolate interprocess communication resources:
System V IPC objects and POSIX message queues
CLONE_NEWIPC
Process ID (pid)
2008
gives processes an independent set of process identifiers (PIDs)
CLONE_NEWPID
Network (net)
2009
virtualize the network stack
CLONE_NEWNET
each namespace contains own resource properties within /proc/net
contains only a loopback interface on initial creation
interfaces can be moved between namespaces
private set of IP addresses, own routing table, socket listing, connection tracking table, firewall, …
User ID (user)
2012
isolation of user and group IDs
since Linux 3.8 without being fully privileged
CLONE_NEWUSER
Use case: unprivileged user outside a namespace while being fully privileged inside
-
/proc/$PID/{u,g}id_map
expose mappings for user and group IDs - can be written once to define the mappings
0
inside
outside
length
1000
1
Example mapping
Control Group (cgroup)
2016
resource limiting, prioritization, accounting and controlling
initial implementation in 2008
major redesign started from 2013
CLONE_NEWCGROUP
Composing Namespaces
makes “containers“ possible
Running a Container
use runc
https://github.com/opencontainers/runc
to run a container from the extracted rootfs
Conclusion
Linux has great isolation techniques built in and a container runtime uses all these isolation features
NAMESPACES(7)
http://man7.org/linux/man-pages/man7/namespaces.7.html
Future topics: runtimes, security, images and orchestration
That’s it.
https://github.com/
saschagrunert/demystifying-containers
Demystifying Containers - Part I: Kernel Space
By Sascha Grunert
Demystifying Containers - Part I: Kernel Space
A series of blog posts and talks about the world of containers
- 1,910