Containers From Scratch

Liz Rice’s containers-from-scratch talk is one of the best explanations of how containers work under the hood. You build a container runtime from ~100 lines of Go, and by the end you understand what Docker is actually doing.

The problem: the code was written when cgroups v1 was standard. Ubuntu 24.04 (Noble) ships with cgroups v2 only, and a few other things have changed since Go 1.16. This post walks through the fixes.

What a container actually is

Before the code, a quick mental model. A container is not a VM. There is no hypervisor, no separate kernel. It’s a regular Linux process — but with a carefully restricted view of the system. That restriction is built from three kernel features:

Namespaces

Namespaces control what a process can see. Linux has several:

Namespace	Flag	Isolates
UTS	`CLONE_NEWUTS`	hostname, domain name
PID	`CLONE_NEWPID`	process IDs
Mount	`CLONE_NEWNS`	filesystem mounts
Network	`CLONE_NEWNET`	network interfaces, routes
IPC	`CLONE_NEWIPC`	shared memory, semaphores
User	`CLONE_NEWUSER`	user and group IDs

When you create a new PID namespace, the first process in it becomes PID 1. It can’t see any processes outside its namespace. When you create a new UTS namespace, you can set a hostname without affecting the host. Mount namespaces let you give the container its own filesystem tree.

These are the primitives Docker, containerd, and every other container runtime are built on.

chroot

chroot changes the root directory (/) of a process to a different directory on the filesystem. Once you call chroot("/some/dir"), the process sees that directory as / and can’t navigate above it. Combined with a mount namespace, this gives filesystem isolation — the container sees only what you put in its root directory.

In practice, you populate that directory with a minimal Linux filesystem (a “rootfs”) — Ubuntu, Alpine, whatever. That’s what container images are: tarballs of a rootfs.

cgroups (control groups)

Namespaces control visibility. cgroups control resources. They let you limit and account for CPU, memory, PIDs, I/O, and more for a group of processes.

The key difference between cgroups v1 and v2:

v1 has a separate hierarchy per controller: /sys/fs/cgroup/memory/, /sys/fs/cgroup/pids/, etc.
v2 has a single unified hierarchy: everything lives under /sys/fs/cgroup/, with different file names.

Ubuntu 24.04 is v2 only. This is what breaks Liz’s original code.

The original code

This is Liz Rice’s original main.go:

package main

import (
	"fmt"
	"io/ioutil"
	"os"
	"os/exec"
	"path/filepath"
	"strconv"
	"syscall"
)

// go run main.go run <cmd> <args>
func main() {
	switch os.Args[1] {
	case "run":
		run()
	case "child":
		child()
	default:
		panic("help")
	}
}

func run() {
	fmt.Printf("Running %v \n", os.Args[2:])
	cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags:   syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
		Unshareflags: syscall.CLONE_NEWNS,
	}
	must(cmd.Run())
}

func child() {
	fmt.Printf("Running %v \n", os.Args[2:])
	cmd := exec.Command(os.Args[2], os.Args[3:]...)
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	must(syscall.Sethostname([]byte("container")))
	must(syscall.Chroot("/home/liz/ubuntufs"))
	must(os.Chdir("/"))
	must(syscall.Mount("proc", "proc", "proc", 0, ""))
	// must(syscall.Mount("thing", "mytemp", "tmpfs", 0, ""))

	cg()
	must(cmd.Run())
	must(syscall.Unmount("proc", 0))
	must(syscall.Unmount("thing", 0))
}

func cg() {
	cgroups := "/sys/fs/cgroup/"
	pids := filepath.Join(cgroups, "pids")
	os.MkdirAll(filepath.Join(pids, "liz"), 0755)
	must(ioutil.WriteFile(filepath.Join(pids, "liz/pids.max"), []byte("20"), 0700))
	// Removes the new cgroup in place after the container exits
	must(ioutil.WriteFile(filepath.Join(pids, "liz/notify_on_release"), []byte("1"), 0700))
	must(ioutil.WriteFile(filepath.Join(pids, "liz/cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700))
}

func must(err error) {
	if err != nil {
		panic(err)
	}
}

What the two-process trick is about

Notice run() re-executes the binary itself via /proc/self/exe with child as the first argument. This is necessary because Cloneflags (which creates the new namespaces) is set on the child process — you can’t retroactively put the current process into a new namespace after it’s started. So the flow is:

go run main.go run /bin/bash
  └── forks /proc/self/exe child /bin/bash   ← new UTS + PID + mount namespaces
        └── sets hostname, chroots, mounts proc, runs /bin/bash

What breaks on Ubuntu 24.04

1. `fork/exec /proc/self/exe: operation not permitted`

Ubuntu 24.04 enables AppArmor restriction on unprivileged user namespaces by default:

cat /proc/sys/kernel/apparmor_restrict_unprivileged_userns
# 1

Fix — either run with sudo, or disable the restriction:

sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0

2. cgroups v2 — the `cg()` function is completely broken

The original code targets /sys/fs/cgroup/pids/ which doesn’t exist on v2. The notify_on_release file also doesn’t exist in v2. And ioutil.WriteFile is deprecated since Go 1.16.

Additionally, cg() is called after Chroot() in the original — which means on a working system it would be writing to <rootfs>/sys/fs/cgroup/, not the real cgroup filesystem. The cgroup limits would never actually apply.

3. `syscall.Unmount("thing", 0)` panics

The tmpfs mount (thing) is commented out but the unmount call is not. This causes a panic when the container exits.

The updated code

package main

import (
	"fmt"
	"os"
	"os/exec"
	"path/filepath"
	"strconv"
	"syscall"
)

// go run main.go run <cmd> <args>
func main() {
	switch os.Args[1] {
	case "run":
		run()
	case "child":
		child()
	default:
		panic("help")
	}
}

func run() {
	fmt.Printf("Running %v \n", os.Args[2:])

	cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags:   syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
		Unshareflags: syscall.CLONE_NEWNS,
	}

	must(cmd.Run())
}

func child() {
	fmt.Printf("Running %v \n", os.Args[2:])

	cmd := exec.Command(os.Args[2], os.Args[3:]...)
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	cg() // ← must happen before chroot, while real /sys is still accessible

	must(syscall.Sethostname([]byte("container")))
	must(syscall.Chroot("/home/antonis/ubuntufs"))
	must(os.Chdir("/"))
	must(syscall.Mount("proc", "proc", "proc", 0, ""))

	must(cmd.Run())

	must(syscall.Unmount("proc", 0))
}

func cg() {
	cgname := "/sys/fs/cgroup/antonis"

	must(os.MkdirAll(cgname, 0755))

	// limit max pids in container
	must(os.WriteFile(filepath.Join(cgname, "pids.max"), []byte("20"), 0700))

	// limit memory to 100MB
	must(os.WriteFile(filepath.Join(cgname, "memory.max"), []byte("104857600"), 0700))

	// add this process to the cgroup
	must(os.WriteFile(filepath.Join(cgname, "cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700))
}

func must(err error) {
	if err != nil {
		panic(err)
	}
}

What changed and why

cg() moved before Chroot() — the most important fix. After Chroot(), /sys/fs/cgroup resolves to inside the container rootfs, not the real cgroup filesystem. Any writes would create plain files with no effect on the kernel. Moving cg() before the chroot means we’re writing to the actual cgroup subsystem.

cgroups v2 paths — instead of /sys/fs/cgroup/pids/<name>/, we create a single directory at /sys/fs/cgroup/<name>/ and write v2 controller files:

v1	v2
`/sys/fs/cgroup/pids/<n>/pids.max`	`/sys/fs/cgroup/<n>/pids.max`
`/sys/fs/cgroup/memory/<n>/memory.limit_in_bytes`	`/sys/fs/cgroup/<n>/memory.max`
`notify_on_release`	not needed — v2 cleans up automatically

ioutil.WriteFile → os.WriteFile — same signature, ioutil is deprecated.

Removed syscall.Unmount("thing", 0) — the tmpfs mount is commented out; unmounting something that was never mounted panics.

Prerequisites

You need a rootfs to chroot into. The easiest way:

sudo apt install debootstrap
sudo debootstrap noble /home/antonis/ubuntufs

And enable the cgroup controllers on the host (once, or add to /etc/sysctl.d/):

sudo sh -c 'echo "+pids +memory" > /sys/fs/cgroup/cgroup.subtree_control'

Running it

sudo env PATH=$PATH go run main.go run /bin/bash

sudo env PATH=$PATH is needed because sudo resets PATH, and if Go is installed in your home directory (e.g. via mise), sudo won’t find it otherwise.

Verifying isolation

Inside the container:

hostname          # container
ls /              # ubuntufs contents only
ps aux            # 3 processes max, PID 1 is the container init

From the host (in a second terminal):

sudo readlink /proc/<pid>/root          # /home/antonis/ubuntufs
cat /sys/fs/cgroup/antonis/pids.max     # 20
cat /sys/fs/cgroup/antonis/memory.max   # 104857600
cat /sys/fs/cgroup/antonis/cgroup.procs # container PIDs

That’s it. Namespaces for isolation, chroot for the filesystem, cgroups for resource limits. Everything else Docker does is tooling built on top of these three primitives.

my DevOps Odyssey

“Σα βγεις στον πηγαιμό για την Ιθάκη, να εύχεσαι να ‘ναι μακρύς ο δρόμος, γεμάτος περιπέτειες, γεμάτος γνώσεις.” - Kavafis’ Ithaka.

Running Liz Rice containers-from-scratch on Ubuntu Noble with cgroups v2 — what breaks and how to fix it.

6 min read · · views

2026-04-15

Series:lab

Categories:Linux

Tags:#containers, #namespaces, #cgroups, #linux

Containers From Scratch: