Liz Rice’s containers-from-scratch talk is one of the best explanations of how containers work under the hood. You build a container runtime from ~100 lines of Go, and by the end you understand what Docker is actually doing.
The problem: the code was written when cgroups v1 was standard. Ubuntu 24.04 (Noble) ships with cgroups v2 only, and a few other things have changed since Go 1.16. This post walks through the fixes.
What a container actually is
Before the code, a quick mental model. A container is not a VM. There is no hypervisor, no separate kernel. It’s a regular Linux process — but with a carefully restricted view of the system. That restriction is built from three kernel features:
Namespaces
Namespaces control what a process can see. Linux has several:
| Namespace | Flag | Isolates |
|---|---|---|
| UTS | CLONE_NEWUTS |
hostname, domain name |
| PID | CLONE_NEWPID |
process IDs |
| Mount | CLONE_NEWNS |
filesystem mounts |
| Network | CLONE_NEWNET |
network interfaces, routes |
| IPC | CLONE_NEWIPC |
shared memory, semaphores |
| User | CLONE_NEWUSER |
user and group IDs |
When you create a new PID namespace, the first process in it becomes PID 1. It can’t see any processes outside its namespace. When you create a new UTS namespace, you can set a hostname without affecting the host. Mount namespaces let you give the container its own filesystem tree.
These are the primitives Docker, containerd, and every other container runtime are built on.
chroot
chroot changes the root directory (/) of a process to a different directory on the filesystem. Once you call chroot("/some/dir"), the process sees that directory as / and can’t navigate above it. Combined with a mount namespace, this gives filesystem isolation — the container sees only what you put in its root directory.
In practice, you populate that directory with a minimal Linux filesystem (a “rootfs”) — Ubuntu, Alpine, whatever. That’s what container images are: tarballs of a rootfs.
cgroups (control groups)
Namespaces control visibility. cgroups control resources. They let you limit and account for CPU, memory, PIDs, I/O, and more for a group of processes.
The key difference between cgroups v1 and v2:
- v1 has a separate hierarchy per controller:
/sys/fs/cgroup/memory/,/sys/fs/cgroup/pids/, etc. - v2 has a single unified hierarchy: everything lives under
/sys/fs/cgroup/, with different file names.
Ubuntu 24.04 is v2 only. This is what breaks Liz’s original code.
The original code
This is Liz Rice’s original main.go:
package main
import (
"fmt"
"io/ioutil"
"os"
"os/exec"
"path/filepath"
"strconv"
"syscall"
)
// go run main.go run <cmd> <args>
func main() {
switch os.Args[1] {
case "run":
run()
case "child":
child()
default:
panic("help")
}
}
func run() {
fmt.Printf("Running %v \n", os.Args[2:])
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
Unshareflags: syscall.CLONE_NEWNS,
}
must(cmd.Run())
}
func child() {
fmt.Printf("Running %v \n", os.Args[2:])
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
must(syscall.Sethostname([]byte("container")))
must(syscall.Chroot("/home/liz/ubuntufs"))
must(os.Chdir("/"))
must(syscall.Mount("proc", "proc", "proc", 0, ""))
// must(syscall.Mount("thing", "mytemp", "tmpfs", 0, ""))
cg()
must(cmd.Run())
must(syscall.Unmount("proc", 0))
must(syscall.Unmount("thing", 0))
}
func cg() {
cgroups := "/sys/fs/cgroup/"
pids := filepath.Join(cgroups, "pids")
os.MkdirAll(filepath.Join(pids, "liz"), 0755)
must(ioutil.WriteFile(filepath.Join(pids, "liz/pids.max"), []byte("20"), 0700))
// Removes the new cgroup in place after the container exits
must(ioutil.WriteFile(filepath.Join(pids, "liz/notify_on_release"), []byte("1"), 0700))
must(ioutil.WriteFile(filepath.Join(pids, "liz/cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700))
}
func must(err error) {
if err != nil {
panic(err)
}
}
What the two-process trick is about
Notice run() re-executes the binary itself via /proc/self/exe with child as the first argument. This is necessary because Cloneflags (which creates the new namespaces) is set on the child process — you can’t retroactively put the current process into a new namespace after it’s started. So the flow is:
go run main.go run /bin/bash
└── forks /proc/self/exe child /bin/bash ← new UTS + PID + mount namespaces
└── sets hostname, chroots, mounts proc, runs /bin/bash
What breaks on Ubuntu 24.04
1. fork/exec /proc/self/exe: operation not permitted
Ubuntu 24.04 enables AppArmor restriction on unprivileged user namespaces by default:
cat /proc/sys/kernel/apparmor_restrict_unprivileged_userns
# 1
Fix — either run with sudo, or disable the restriction:
sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0
2. cgroups v2 — the cg() function is completely broken
The original code targets /sys/fs/cgroup/pids/ which doesn’t exist on v2. The notify_on_release file also doesn’t exist in v2. And ioutil.WriteFile is deprecated since Go 1.16.
Additionally, cg() is called after Chroot() in the original — which means on a working system it would be writing to <rootfs>/sys/fs/cgroup/, not the real cgroup filesystem. The cgroup limits would never actually apply.
3. syscall.Unmount("thing", 0) panics
The tmpfs mount (thing) is commented out but the unmount call is not. This causes a panic when the container exits.
The updated code
package main
import (
"fmt"
"os"
"os/exec"
"path/filepath"
"strconv"
"syscall"
)
// go run main.go run <cmd> <args>
func main() {
switch os.Args[1] {
case "run":
run()
case "child":
child()
default:
panic("help")
}
}
func run() {
fmt.Printf("Running %v \n", os.Args[2:])
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
Unshareflags: syscall.CLONE_NEWNS,
}
must(cmd.Run())
}
func child() {
fmt.Printf("Running %v \n", os.Args[2:])
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cg() // ← must happen before chroot, while real /sys is still accessible
must(syscall.Sethostname([]byte("container")))
must(syscall.Chroot("/home/antonis/ubuntufs"))
must(os.Chdir("/"))
must(syscall.Mount("proc", "proc", "proc", 0, ""))
must(cmd.Run())
must(syscall.Unmount("proc", 0))
}
func cg() {
cgname := "/sys/fs/cgroup/antonis"
must(os.MkdirAll(cgname, 0755))
// limit max pids in container
must(os.WriteFile(filepath.Join(cgname, "pids.max"), []byte("20"), 0700))
// limit memory to 100MB
must(os.WriteFile(filepath.Join(cgname, "memory.max"), []byte("104857600"), 0700))
// add this process to the cgroup
must(os.WriteFile(filepath.Join(cgname, "cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700))
}
func must(err error) {
if err != nil {
panic(err)
}
}
What changed and why
cg() moved before Chroot() — the most important fix. After Chroot(), /sys/fs/cgroup resolves to inside the container rootfs, not the real cgroup filesystem. Any writes would create plain files with no effect on the kernel. Moving cg() before the chroot means we’re writing to the actual cgroup subsystem.
cgroups v2 paths — instead of /sys/fs/cgroup/pids/<name>/, we create a single directory at /sys/fs/cgroup/<name>/ and write v2 controller files:
| v1 | v2 |
|---|---|
/sys/fs/cgroup/pids/<n>/pids.max |
/sys/fs/cgroup/<n>/pids.max |
/sys/fs/cgroup/memory/<n>/memory.limit_in_bytes |
/sys/fs/cgroup/<n>/memory.max |
notify_on_release |
not needed — v2 cleans up automatically |
ioutil.WriteFile → os.WriteFile — same signature, ioutil is deprecated.
Removed syscall.Unmount("thing", 0) — the tmpfs mount is commented out; unmounting something that was never mounted panics.
Prerequisites
You need a rootfs to chroot into. The easiest way:
sudo apt install debootstrap
sudo debootstrap noble /home/antonis/ubuntufs
And enable the cgroup controllers on the host (once, or add to /etc/sysctl.d/):
sudo sh -c 'echo "+pids +memory" > /sys/fs/cgroup/cgroup.subtree_control'
Running it
sudo env PATH=$PATH go run main.go run /bin/bash
sudo env PATH=$PATH is needed because sudo resets PATH, and if Go is installed in your home directory (e.g. via mise), sudo won’t find it otherwise.
Verifying isolation
Inside the container:
hostname # container
ls / # ubuntufs contents only
ps aux # 3 processes max, PID 1 is the container init
From the host (in a second terminal):
sudo readlink /proc/<pid>/root # /home/antonis/ubuntufs
cat /sys/fs/cgroup/antonis/pids.max # 20
cat /sys/fs/cgroup/antonis/memory.max # 104857600
cat /sys/fs/cgroup/antonis/cgroup.procs # container PIDs
That’s it. Namespaces for isolation, chroot for the filesystem, cgroups for resource limits. Everything else Docker does is tooling built on top of these three primitives.