cgroups: From Chaos to Control

Everything in Linux is a file … and every Linux container’s resource limits are no different. They’re just files that live in a special directory.

Inside that directory are files with names like memory.max and cpu.weight. When the kernel decides your container has used too much RAM, it isn’t consulting a daemon or asking Kubernetes. It’s reading a number out of a file you could have written yourself with echo. That’s the whole mechanism. The story of how Linux got here, and the decade-long detour it took along the way, is the story of cgroups.


See It For Yourself

If you’re on a Linux machine running a modern distribution, try this:

Terminal window
ls /sys/fs/cgroup/

You’ll see something like:

cgroup.controllers cpu.stat memory.pressure
cgroup.max.depth cpuset.cpus memory.stat
cgroup.procs io.pressure pids.current
cgroup.subtree_control memory.current pids.max
cpu.pressure memory.max system.slice/
user.slice/

This is a kernel pseudo-filesystem mounted under /sys. On a cgroupsv2 host, /sys/fs/cgroup itself is cgroup2fs, not ordinary disk-backed storage. Reading a file there asks the kernel a question. Writing to one issues a command. The kernel responds by changing its own state. There is no daemon in the middle.

memory.max is the memory limit. cpu.pressure reports whether tasks are stalling on CPU. system.slice/ and user.slice/ are sub-folders: child groups, with their own limits inside.


What a Cgroup Actually Is

A cgroup (short for “control group”) is a Linux kernel mechanism for organizing processes into named groups and applying resource constraints to those groups. That’s the entire definition.

Membership lives in a file: cgroup.procs contains the PIDs of processes in the group. To move a process in, you write its PID. To take it out, write that PID to another group’s cgroup.procs. The kernel handles the bookkeeping.

The mental model is this: a cgroup is a directory, process membership is a file entry, and every resource limit is a knob you turn by writing to a file.

The kernel reads these files and enforces the constraints. If a process in a cgroup pushes memory usage toward memory.max, the kernel first tries to reclaim pages from that cgroup; if usage crosses memory.high along the way, allocations are throttled with backpressure; if reclaim can’t keep up and usage hits memory.max, the kernel OOM-kills a process inside the cgroup. No daemon, no userspace agent. It’s the kernel itself doing the enforcement.

This design is elegant. What went wrong is what always goes wrong in systems that grow organically: nobody planned for containers.


Try It Yourself

Tested on: Ubuntu 25.10, kernel 7.0.5, systemd 257, runc 1.3.4. Anything cgroupsv2 with these or newer should behave identically.

The cleanest way to feel how cgroups work is to create one, cap it, and watch the kernel enforce the cap. Do this on a disposable Linux VM or lab host, not a production node. Any cgroupsv2 system will do.

First, make sure the kernel will expose the controllers we want in our new cgroup. On v2, a child cgroup only sees a controller if its parent has enabled it in cgroup.subtree_control:

Terminal window
# Should list at least: memory pids (and probably cpu, io, cpuset)
cat /sys/fs/cgroup/cgroup.subtree_control
# If memory or pids is missing, enable them at the root:
echo "+memory +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

Most modern distros enable these by default, but managed nodes and minimal images sometimes don’t. With that out of the way:

Terminal window
# Create the cgroup (it's literally a directory)
sudo mkdir /sys/fs/cgroup/demo
# Cap memory at 50 MiB, swap at 0 (otherwise overflow spills to swap
# on hosts where swap is enabled, masking the kill)
echo $((50 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/demo/memory.max
echo 0 | sudo tee /sys/fs/cgroup/demo/memory.swap.max
# Run a memory hog inside the cgroup. The subshell joins the cgroup
# by writing its PID to cgroup.procs, then execs python which tries
# to allocate 200 MiB.
sudo sh -c '
echo $$ > /sys/fs/cgroup/demo/cgroup.procs
exec python3 -c "x = b\"A\" * (200 * 1024 * 1024); print(\"survived\")"
'

You’ll see Killed rather than survived. The kill is scoped to the cgroup we just made. On hosts where you can read dmesg:

Terminal window
sudo dmesg | tail -n 20
# ... Memory cgroup out of memory: Killed process 12345 (python3) ...
# ... oom-kill:constraint=CONSTRAINT_MEMCG,...,oom_memcg=/demo,...

On hosts where dmesg is restricted (managed nodes, hardened distros), read the cgroup’s own counter instead:

Terminal window
cat /sys/fs/cgroup/demo/memory.events
# low 0
# high 0
# max <n> # times memory.max was hit
# oom <n> # times the cgroup went OOM
# oom_kill <n> # processes the kernel killed

No daemon was involved. The kernel read memory.max (50 MiB), watched memory.current climb past it, and pulled the trigger. Clean up the cgroup once it has no live processes (an empty cgroup is the only kind rmdir will remove):

Terminal window
sudo rmdir /sys/fs/cgroup/demo

For real workloads on a systemd host, prefer systemd-run with unit properties (--property=MemoryMax=50M, etc.) over editing /sys/fs/cgroup/ by hand. systemd owns the cgroup tree, and writing directly to it from outside breaks its accounting. The raw interface above is the right shape for learning what the kernel is doing; it’s the wrong shape for production resource control.

Three primitives. A directory, a file write, a PID. That’s the kernel interface Linux container runtimes are built on.


Now Block a Fork Bomb

danger

Look, don’t be stupid, do not trust my code, do not say I didn’t warn you. Only do this on a disposable Linux VM or lab host you don’t mind rebooting, not a production node or your laptop. If anything goes wrong (the cgroup isn’t set up correctly, your shell isn’t actually inside it, the limit isn’t applied), the spawn loop further down has nothing to stop it and will exhaust the kernel’s process table within seconds. Recovery means a hard reset. Read the whole section before running anything, and verify cat /proc/self/cgroup shows 0::/fork-demo before you launch the loop.

The memory demo killed a process after it allocated too much. The pids controller is more interesting: it refuses to create the process at all. Same files, same write-a-PID-and-go pattern, but the kernel says no at fork time rather than reclaim time.

Terminal window
# sudo doesn't apply to shell redirection, so `sudo echo X > /sys/fs/...`
# writes as your user and fails. Drop into a real root shell instead.
sudo -i
# Create the cgroup and cap it at 10 processes
mkdir /sys/fs/cgroup/fork-demo
echo 10 > /sys/fs/cgroup/fork-demo/pids.max
# Move the current shell into it
echo $$ > /sys/fs/cgroup/fork-demo/cgroup.procs
cat /proc/self/cgroup
# 0::/fork-demo

Now spawn background sleep processes in a loop and let the counter climb to the wall:

Terminal window
i=0
while true; do
sleep 60 &
i=$((i+1))
echo "spawned #$i → PID $! (pids.current=$(cat /sys/fs/cgroup/fork-demo/pids.current))"
sleep 0.2
done
spawned #1 → PID 367 (pids.current=3)
spawned #2 → PID 370 (pids.current=4)
spawned #3 → PID 373 (pids.current=5)
spawned #4 → PID 376 (pids.current=6)
spawned #5 → PID 379 (pids.current=7)
spawned #6 → PID 382 (pids.current=8)
spawned #7 → PID 385 (pids.current=9)
spawned #8 → PID 388 (pids.current=10)
bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable

The eighth sleep pushes pids.current to exactly pids.max=10. (The shell counts as one process, and the cat inside the command substitution briefly shows up as another, which is why the visible count starts at 3.) Every subsequent fork() returns EAGAIN, bash prints the failure, and the cgroup is sealed. Ctrl+C to stop.

The kernel keeps a running tally of denials:

Terminal window
cat /sys/fs/cgroup/fork-demo/pids.events
# max 4823

That’s how many fork() calls the kernel refused since the cgroup was created. Every denied fork() is one the rest of the system never had to deal with. Point a real fork bomb at this cgroup and the counter climbs into the millions while the host stays responsive, but I’m not going to print the bomb here, because copy-pasting one into the wrong shell is exactly how lab hosts become bricks. The sleep loop above already proves the mechanism: the kernel returns EAGAIN to fork() the moment pids.current hits pids.max, and it will do that whether the caller is a polite shell loop or a self-replicating function.

Cleanup once the loop has exited. The root shell still belongs to fork-demo, and the kernel won’t let us rmdir a cgroup with live members, so exit the shell first and remove the directory from the parent shell that spawned it:

Terminal window
kill $(jobs -p) # reap the sleeps from inside the root shell
exit # drop out of the root shell; its PID leaves the cgroup with it
sudo rmdir /sys/fs/cgroup/fork-demo

Two cgroups, two different <controller>.max files, two completely different kinds of refusal: memory.max reclaims-then-kills, pids.max returns EAGAIN to fork(). The kernel exposes both behaviours through identical-looking interfaces, which is exactly the point of cgroups as an abstraction.


Controllers: Who Fills the Folder

A cgroup is a folder. But who fills it with files like memory.max and cpu.weight?

The kernel does, but not as a single monolithic block. Each resource is governed by a controller: a piece of kernel code dedicated to one resource type. The memory controller watches RAM, exposes memory.max, memory.current, memory.pressure, and enforces the limits. The cpu controller weights and quota-caps CPU time. The io controller throttles block I/O. The pids controller caps the number of processes.

Most resource-control files in a cgroup folder belong to one controller. The naming convention makes this explicit: <controller>.<metric>. memory.current, cpu.stat, io.max. Core cgroup files use the cgroup.* prefix instead.

The set of controllers available on your system lives in one file:

Terminal window
cat /sys/fs/cgroup/cgroup.controllers
# cpuset cpu io memory pids

Five controllers on this system. Today.

It used to be many controllers spread across many possible trees, and that’s how the trouble started.


Where This Came From

In 2006, Paul Menage and Rohit Seth proposed adding “process containers” to the Linux kernel: a way to group processes and attach resource controllers to those groups. The name was quickly changed to avoid confusion with the other kind of container that was starting to gain traction. Solaris Zones, Linux-VServer, and OpenVZ all already used “container” to mean a fully isolated user-space (own filesystem, network, process tree, users), and this new patch did none of that. It just grouped processes and counted their resource use.

By the time the patch merged into Linux 2.6.24 in January 2008, it was called Task Control Groups (TCG). The release notes describe it as a framework that can “track and group processes into arbitrary ‘cgroups’ and assign arbitrary state to those groups, in order to control its behaviour”, with other subsystems hooking in to provide new attributes. The “Task” prefix quietly fell out of use over the next few years; everyone settled on the short form: control groups, or cgroups. From there, controllers were added one by one, year after year:

cgroups Timeline

From process containers to universal v2 adoption

cgroups v1
cgroups v2
Kubernetes
Distro adoption
Scroll to explore

By 2013, more than a dozen controllers existed. The original design had made a decision that seemed perfectly reasonable at the time: controllers could live in independent trees. CPU could get one tree. Memory could get another. Block I/O could get another. The promise was flexibility: you could organize processes differently for different resources.

In practice, almost nobody used that flexibility. And the cost would dominate the next decade.


Hierarchies: Cgroups Inside Cgroups

Cgroups nest. A cgroup can contain child cgroups, and those children can have their own children. The result is a tree. The kernel calls it a hierarchy.

A typical cgroup tree

How systemd lays out cgroups on a modern Linux host. Limits set on any node cascade to every descendant.

root mount
.slice (organisational)
.service / .scope (processes live here)
cgroup hierarchy: /sys/fs/cgroup/ branches into system.slice and user.slice. system.slice contains nginx.service and postgres.service. user.slice contains user-1000.slice which contains session-2.scope./sys/fs/cgroup/system.slice/user.slice/nginx.service/postgres.service/user-1000.slice/session-2.scope/

Children inherit constraints from their parents. If system.slice/ caps memory at 8 GiB, no combination of nginx.service/ plus postgres.service/ may exceed 8 GiB together, even if neither child sets a limit of its own. Limits cascade down; usage rolls up.

One tree, one hierarchy. Hold on to that picture.

In cgroupsv1, there wasn’t one tree. There could be many.


cgroupsv1: The Wild West

You’re debugging a container that keeps getting OOM killed. You SSH into the node, find the PID, and ask the kernel which cgroups the process belongs to. That information lives in a per-process file under /proc, the other kernel pseudo-filesystem, where each running process has its own directory:

Terminal window
cat /proc/1234/cgroup

And you get… this:

12:blkio:/docker/a1b2c3d4e5f6
11:devices:/docker/a1b2c3d4e5f6
10:memory:/docker/a1b2c3d4e5f6
9:cpuacct:/docker/a1b2c3d4e5f6
8:cpu:/docker/a1b2c3d4e5f6
7:cpuset:/docker/a1b2c3d4e5f6
6:freezer:/docker/a1b2c3d4e5f6
5:net_cls,net_prio:/docker/a1b2c3d4e5f6
4:perf_event:/docker/a1b2c3d4e5f6
3:pids:/docker/a1b2c3d4e5f6
2:hugetlb:/docker/a1b2c3d4e5f6
1:name=systemd:/docker/a1b2c3d4e5f6

Twelve lines. The same container ID, repeated across many hierarchies. The exact list varied by distro and runtime because related controllers could be co-mounted, but the important part is the same: one process had multiple cgroup memberships. (A few of those names won’t survive into v2 in the same form. freezer, for instance, was v1’s way of pausing every process in a cgroup at once — useful for checkpoint/restore and as the prerequisite step to safely killing a process tree. v2 replaces it with cgroup.freeze and cgroup.kill on the unified hierarchy.)

This is cgroupsv1. And it’s a mess.

The process is in the memory tree, and the cpu tree, and the blkio tree, and several others. Each hierarchy has its own mount point under /sys/fs/cgroup/:

Terminal window
ls /sys/fs/cgroup/
# cpu/ cpuacct/ cpuset/ memory/ blkio/ devices/
# freezer/ net_cls/ net_prio/ perf_event/ hugetlb/ pids/

Many directories, many mount points, many independent trees. Here’s what that fragmentation actually looks like:

The Fragmented Hierarchy
/sys/fs/cgroup/cpu/
nginx[shares: 512]
redis[shares: 1024]
sshd.service
/sys/fs/cgroup/memory/
nginx[limit: 256M]
redis[limit: 512M]
sshd.service
/sys/fs/cgroup/blkio/
nginx[weight: 100]
redis[weight: 500]
sshd.service
The Problem
nginx and redis appear in 3 separate trees. No unified view. No atomic moves. No combined resource accounting.
12+
Controllers
12+
Mount Points
Manual
Hierarchy Sync

With a v1 cgroupfs layout, when a container runtime creates a container, it has to:

  1. Create a directory in /sys/fs/cgroup/cpu/docker/<container-id>/
  2. Create a directory in /sys/fs/cgroup/memory/docker/<container-id>/
  3. Create a directory in /sys/fs/cgroup/blkio/docker/<container-id>/
  4. … repeat for every mounted controller hierarchy
  5. Write the process PID to cgroup.procs in each directory
  6. Set the appropriate limits in each directory

Compare that to cgroupsv2, where the same process gets one line:

0::/kubepods.slice/kubepods-pod1a2b3c.slice/cri-containerd-a1b2c3d4e5f6.scope

It’s worth pausing here, because the original v1 design was textbook good engineering. Independent trees per controller is exactly how a senior engineer would propose this from scratch today: small, decoupled subsystems, each responsible for one resource, free to evolve on its own timeline. Every Cloud Native principle we now take for granted points at that shape. The decoupling is a feature when the consumer is a sysadmin grouping arbitrary processes for arbitrary policies, limiting CPU on one grouping while throttling I/O on another.

Containers turn that assumption inside out. A container is a bundle that wants every controller applied to the same set of processes, atomically, with one accounting view. The flexibility v1 offered became a tax: every controller wired up separately, every limit set separately, every monitoring query stitched back together by hand. v2 is the convergence that shape demanded.


Why It Was a Mess

The structural problem cascaded into operational pain.

Operations weren’t atomic. Moving a process between cgroups meant writing its PID multiple times, across multiple cgroup.procs files. There was no way to say “move this process across all controllers at once.” If your runtime crashed between step 3 and step 4, the process might be limited on CPU and memory but not on block I/O. Container runtimes had to build their own retry-and-synchronization logic on top.

Interfaces were inconsistent. CPU alone had two completely different paradigms. cpu.shares (values from 2 to 262144) gave you proportional weighting: a container with 1024 shares got twice the CPU time of one with 512, but only when both were competing. If you were alone, shares didn’t matter. cpu.cfs_quota_us (paired with cpu.cfs_period_us) gave you absolute caps: quota 50000, period 100000 meant 50% of one core, hard. Two files, two mental models, same controller.

Accounting was a separate controller from control. Want to know how much CPU your cgroup used? Check cpuacct.usage. A different controller, in a different hierarchy, with its own mount point. Want to limit it? Use cpu.cfs_quota_us over in the cpu controller. Nobody was happy about that decision.

Memory had two failed concepts. memory.soft_limit_in_bytes was supposedly advisory, but the kernel’s reclaim behavior around it was so unpredictable that most operators treated it as decorative. memory.memsw.limit_in_bytes capped memory plus swap combined. You couldn’t set a swap limit independently, so you did arithmetic to figure out the effective swap allowance.

No unified view. To answer “how much total resource is this container using?”, monitoring tools had to read CPU from paths like /sys/fs/cgroup/cpuacct/docker/<id>/cpuacct.usage, memory from /sys/fs/cgroup/memory/docker/<id>/memory.usage_in_bytes, block I/O from /sys/fs/cgroup/blkio/docker/<id>/blkio.throttle.io_service_bytes, and stitch the answers together by correlating paths across hierarchies. cAdvisor, Prometheus node_exporter, and every other monitoring tool had to implement this stitching. It worked. It was fragile, slow, and ugly.

(There were also smaller indignities. The cpuset controller’s cgroup.clone_children flag, where child cgroups didn’t inherit the parent’s CPU and memory node assignments unless you flipped it on, broke expectations everywhere else. v2 dropped the flag entirely: children of a cpuset cgroup inherit their parent’s cpus and mems by default, and that’s that.)

Tejun Heo saw the mess and proposed a fix in early 2014: a single unified hierarchy. That patchset would take two years to land and another four to stabilize.


cgroupsv2: One Tree

Tejun Heo’s rewrite kept the conceptual simplicity of “cgroups are directories with resource knobs” and fixed the structural mistake: there is exactly one hierarchy.

That decision ripples through every aspect of the design.

Terminal window
# cgroupsv2: one mount point
mount | grep cgroup
# cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
# One line in /proc/self/cgroup
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/session-2.scope

Every controller lives on the same tree. When you move a process from one cgroup to another, it moves in all controllers atomically: one write to one cgroup.procs file. When you read resource usage, everything is in one place.

The “No Internal Processes” Rule

cgroupsv2 enforces a rule that confused people initially: a non-root cgroup that distributes domain resources to child cgroups cannot also have processes directly in it.

Why? Resource accounting. If a parent cgroup has both processes and child cgroups, who gets charged for the parent’s processes? Do they compete with the children? How do you account for their usage? In v1, the answer varied by controller and was often wrong. Exactly the inconsistent half-applied behavior we just saw.

v2 eliminates the ambiguity where it matters. Once a non-root cgroup enables domain controllers in cgroup.subtree_control, processes need to live in leaf cgroups below it. A cgroup can still have both processes and children before it starts distributing those controllers, and the root cgroup is special-cased. Clean accounting, fewer controller-specific edge cases.

Explicit Controller Delegation

In v1, every controller was implicitly available everywhere. In v2, parents explicitly delegate controllers to their children:

Terminal window
# See which controllers are available
cat /sys/fs/cgroup/cgroup.controllers
# cpuset cpu io memory pids
# Enable cpu and memory for child cgroups
echo "+cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control
# Now child cgroups can use cpu and memory controllers

This creates a clean permission model. A child cgroup can only use controllers its parent delegated. Container runtimes and systemd need exactly this when granting resource control to unprivileged processes (like rootless containers) without giving them access to controllers they shouldn’t touch.

Atomic Kill

v2 added a feature that v1 never had a clean answer for: killing every process in a cgroup, atomically, with one write:

Terminal window
echo 1 > /sys/fs/cgroup/some-group/cgroup.kill

The kernel walks the cgroup (and its descendants) and sends SIGKILL to every process in one pass. No racing the process tree, no chasing PIDs that fork while you’re trying to enumerate them, no freezer-controller dance to pause the tree before you swing. Container runtimes use this to terminate a container cleanly; Kubernetes uses it indirectly through the runtime for pod shutdown.

(Some workloads, like databases, JVMs, and game engines, need per-thread rather than per-process control. v2 supports this via cgroup.type = threaded, but it’s a niche we won’t dwell on here.)


The Delta, Controller by Controller

Here’s exactly what changed at the file level. Skim these tables if you’re migrating; the simulator below lets you feel how a v1 setting and its v2 equivalent translate to the actual file contents.

cgroup
cpu.max50000 100000
50% of one CPU core5%100%
memory.max268435456
256 MiB0MiB1,024MiB
io.max1000
1,000 read IOPS0IOPS10,000IOPS
$ cat cpu.max
50000 100000
$ cat memory.max
268435456
$ cat memory.current
126164664
$ cat io.max
8:0 riops=1000 wiops=max rbps=max wbps=max

CPU

cgroupsv1cgroupsv2What Changed
cpu.shares (2 to 262144)cpu.weight (1 to 10000, default 100)Rescaled range, saner defaults
cpu.cfs_quota_us / cpu.cfs_period_uscpu.max ($QUOTA $PERIOD)One file instead of two
cpuacct.usage (separate controller!)cpu.statAccounting merged into cpu controller
cpuset.cpus (separate hierarchy)cpuset.cpus (same hierarchy)No more separate mount point

The cpu.weight rescaling is a quality-of-life improvement. Instead of the bizarre 2 to 262144 range, you get 1 to 10000 with a default of 100. A container with weight 200 gets twice the CPU time of one with weight 100. Simple.

cpu.max combines quota and period into a single file: "50000 100000" means 50 ms of CPU per 100 ms (50% of one core). Write "max 100000" for unlimited. No more reading two files and doing mental division to figure out the effective limit.

Memory

cgroupsv1cgroupsv2What Changed
memory.limit_in_bytesmemory.maxHard limit, cleaner name
memory.soft_limit_in_bytesmemory.highActually works now; applies throttling backpressure
memory.memsw.limit_in_bytesmemory.swap.maxIndependent swap control
memory.usage_in_bytesmemory.currentCleaner name
n/amemory.low / memory.minNew: memory protection guarantees
n/amemory.pressureNew: PSI stall metrics

The big behavioural change is memory.high. In v1, memory.soft_limit_in_bytes was almost useless. The kernel’s reclaim behaviour around it was unpredictable and slow. In v2, memory.high actively throttles allocations and triggers direct reclaim when the cgroup hits this threshold. Most workloads feel this as backpressure rather than a kill, which is exactly the point. It is not a kill-prevention guarantee, though: if a process allocates faster than the kernel can reclaim, memory.max (or the global OOM killer) is still in play.

memory.low and memory.min are entirely new concepts: memory protection. memory.min is a hard reclaim barrier within its effective range: a child cannot protect more memory than its parents already protect, because the kernel caps the child’s effective memory.min to whatever the ancestor chain allows. memory.low is best-effort protection on the same model; it shields the cgroup during system-wide reclaim, but under local cgroup reclaim (the cgroup is at its own memory.high or memory.max) the kernel will still reclaim from it. Together they let memory-sensitive workloads coexist with bursty ones on the same node without one starving the other.

I/O

cgroupsv1cgroupsv2What Changed
blkio.weightio.weightController renamed entirely
blkio.throttle.read_bps_deviceio.maxUnified format per device
n/aio.latencyNew: latency-targeted I/O control
n/aio.pressureNew: PSI stall metrics

The controller was renamed from blkio to io. The interface was cleaned up dramatically: instead of separate files for read bytes, write bytes, read IOPS, and write IOPS, io.max uses a unified per-device format:

8:0 rbps=1048576 wbps=max riops=1000 wiops=max

Device 8:0, read bandwidth limited to 1 MB/s, write bandwidth unlimited, read IOPS limited to 1000, write IOPS unlimited. One file, one line per device, all limits together.

io.latency is new and powerful: instead of specifying throughput limits, you specify a target latency and the kernel throttles competing workloads to achieve it. This is much closer to what real applications actually care about.

Cleaned-Up Naming

v2 standardised the verbs across all controllers:

  • max: hard limits (exceed triggers enforcement)
  • high: soft limits (exceed triggers backpressure)
  • low / min: protection (guaranteed minimums)
  • current / stat: usage and metrics

Memory controller values are in bytes, not pages, so no more multiplying by the kernel’s page size to know what you wrote. "max" as a string means unlimited. No more _in_bytes suffixes everywhere. Consistent naming: <controller>.<metric> end to end.


PSI: The Feature That Justifies the Rewrite

If there’s one feature that pays for cgroupsv2’s existence on its own, it’s PSI, or Pressure Stall Information.

PSI itself isn’t new to cgroupsv2. It landed in Linux 4.20 (2018) and has always exposed system-wide files under /proc/pressure/cpu, /proc/pressure/memory, and /proc/pressure/io. What v2 adds is the same signal per cgroup: every cgroupv2 directory gets its own cpu.pressure, memory.pressure, and io.pressure file, scoped to the processes inside it. That’s the missing piece v1 simply couldn’t produce.

These tell you how much time tasks in the cgroup are stalling: waiting for a resource instead of doing useful work.

Terminal window
cat /sys/fs/cgroup/kubepods.slice/memory.pressure
# some avg10=2.50 avg60=1.30 avg300=0.85 total=48291738
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some means at least one task is stalling. full means all tasks are stalling. The avg10, avg60, avg300 numbers are percentages over 10-, 60-, and 300-second windows.

This is transformative for monitoring. Instead of waiting for an OOM kill to tell you something went wrong (reactive), you can watch pressure metrics climb and take action before it happens (proactive). some avg10=5.20 means tasks were stalled on memory 5.2% of the last ten seconds. A yellow flag. Scale horizontally before pressure hits 20%. Throttle ingestion before I/O pressure spikes.

Combined with memory.events (which counts OOM kills, reclaim activity, and high-watermark hits), you get much better visibility into the kernel’s reclaim, throttling, and OOM behaviour for a specific workload, without having to attribute system-wide signals back to a guilty cgroup yourself. memory.events isn’t a one-off: most v2 controllers expose a <controller>.events file (pids.events, cpu.stat for usage and throttling counters, io.stat) on the same shape, so once you’ve learned to read one you’ve learned to read them all.


Kubernetes Maps Policy Onto Cgroups

Everything we’ve talked about so far is what Kubernetes builds on. When you set resources.requests and resources.limits in a pod spec, the kubelet and container runtime translate those settings into cgroup configuration. Nothing mystical.

But before we look at the mapping, two pieces of vocabulary the kubelet uses heavily: slices and scopes.

On most modern Kubernetes Linux nodes, systemd is the host’s cgroup manager: it creates, names, and manages cgroup directories on the kernel’s behalf, and everything else (the kubelet, the container runtime) integrates with it rather than writing to /sys/fs/cgroup/ directly. .slice and .scope are simply systemd’s naming conventions for those cgroups. A .slice is a cgroup used to organise other cgroups (an internal node in the tree). A .scope is a cgroup whose processes were started outside the unit itself (a leaf node containing externally-launched PIDs, like the ones a container runtime hands to systemd). That’s why cgroup paths on nodes using the systemd cgroup driver look like systemd named them. Because systemd did.

Pod SpecYAML
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "256Mi"
Cgroup Mappingcgroups v2
# cgroup path (systemd driver):
/kubepods.slice/
kubepods-pod<uid>.slice/
cri-containerd-<id>.scope
# cpu.max
50000 100000 # 500m = 50% of 1 core
# memory.max
268435456 # 256Mi
# memory.min (only when MemoryQoS alpha gate is on)
268435456 # 256Mi (protected, requests == limits)
# OOM score adjustment: -997
# QoS: Guaranteed (requests == limits)

Guaranteed:OOM score adjustment of -997 makes Guaranteed pods the last targets for the kernel OOM killer. memory.min is only written per-pod when the alpha MemoryQoS feature gate is enabled; with the default kubelet, requests influence scheduling and eviction but do not set memory.min on individual pod cgroups.

How Kubernetes Lays Out the Cgroup Tree

On a node using the systemd cgroup driver with cgroupsv2, the Kubernetes cgroup layout looks like this:

/sys/fs/cgroup/
└── kubepods.slice/
├── kubepods-pod<uid>.slice/ # Guaranteed pods
│ └── cri-containerd-<container-id>.scope
├── kubepods-burstable.slice/ # Burstable pods
│ └── kubepods-burstable-pod<uid>.slice/
│ └── cri-containerd-<container-id>.scope
└── kubepods-besteffort.slice/ # BestEffort pods
└── kubepods-besteffort-pod<uid>.slice/
└── cri-containerd-<container-id>.scope

The kubelet manages the pod and QoS cgroups, the container runtime manages the container cgroups, and systemd is the host cgroup manager they integrate with.

QoS Classes Shape Cgroups and OOM Priority

Kubernetes Quality of Service classes shape the cgroup layout and the kernel OOM score adjustment. Node-pressure eviction is a little more nuanced: the kubelet sorts pods first by whether their usage exceeds their requests (over-request pods go before under-request ones), then by Pod priority, then by how far usage has climbed past the request. QoS is still the useful mental model: BestEffort tends to go first, Guaranteed tends to go last.

Guaranteed pods have CPU and memory requests equal to limits for every container (or the equivalent pod-level resources, Kubernetes 1.32+). They get their own slice directly under kubepods.slice. The kubelet sets their OOM score adjustment to -997, making them the last to be killed by the kernel OOM killer.

Burstable pods have at least one CPU or memory request or limit, but do not meet the Guaranteed rules. They go under kubepods-burstable.slice. Their OOM score is adjusted based on the ratio of memory request to node capacity: more request means lower score means killed later.

BestEffort pods have no CPU or memory requests or limits at either container or pod level. They go under kubepods-besteffort.slice with an OOM score adjustment of 1000. When the node runs out of memory, these are the easiest targets.

With the optional MemoryQoS feature on cgroupsv2 nodes, Kubernetes can also write memory.min, memory.low, and memory.high. As of Kubernetes 1.36, tiered memory reservation is still alpha and opt-in: with memoryReservationPolicy: TieredReservation, Guaranteed pods get memory.min protection and Burstable pods get memory.low protection. With the default reservation policy, requests do not automatically become memory.min or memory.low.

From Pod Spec to Cgroup Files

The default translation is:

resources:
requests:
cpu: "500m" # influences CPU shares / cpu.weight
memory: "256Mi" # scheduler placement and eviction accounting
limits:
cpu: "500m" # becomes cpu.max = "50000 100000" (50% of one core)
memory: "256Mi" # becomes memory.max = 268435456 (bytes)

CPU requests become proportional CPU scheduling weight. Kubernetes still starts from the v1-era cpu.shares calculation, and the OCI runtime maps those shares to cgroupsv2 cpu.weight. The exact number depends on the runtime version:

RequestNewer runc/crun cpu.weightOlder linear conversion
100m~17~4
1000m~100~39

The Linux CPU scheduler uses that weight for proportional CPU time when cores are contended. CPU limits become cpu.max: 500m translates to a quota of 50000 µs per 100000 µs period, which is 50% of one core. Memory limits map directly to memory.max in bytes. Memory requests influence scheduling, eviction, and optional MemoryQoS protection when that feature is enabled and configured.

The Cgroup Driver Choice

The kubelet has two options for managing the cgroup tree.

cgroupfs driver: the kubelet directly creates directories and writes to cgroup files. Simple, but it creates a split-brain situation: systemd (which manages system services and their cgroups) doesn’t know about the kubelet’s cgroups, and the kubelet doesn’t know about systemd’s. Under memory pressure this can lead to instability because systemd and the kubelet make independent, potentially conflicting resource decisions.

systemd driver: the kubelet and runtime integrate with systemd, often through transient units, and systemd manages the cgroup tree. Single source of truth. Both the kubelet and system services are visible in the same hierarchy, managed by the same process.

The paths look noticeably different. With the cgroupfs driver, the kubelet writes flat, kubelet-shaped directories:

/sys/fs/cgroup/kubepods/
├── pod<uid>/
│ └── <container-id>/
├── burstable/
│ └── pod<uid>/
│ └── <container-id>/
└── besteffort/
└── pod<uid>/
└── <container-id>/

With the systemd driver, the same logical tree comes out with systemd’s .slice/.scope suffixes baked into every directory name — that’s the layout shown earlier under “How Kubernetes Lays Out the Cgroup Tree.” Same structure, different names; tools that grep by path need to handle both.

For kubeadm clusters, Kubernetes 1.22 made systemd the recommended driver and switched kubeadm’s own default to it (we all remember this migration, and not fondly; right?). The kubelet’s standalone default is still cgroupfs, but Kubernetes recommends systemd whenever systemd is the init system, and requires the kubelet and runtime to use the systemd cgroup driver for cgroupsv2 support. If you’re still on cgroupfs with systemd, you’re running against the grain.

How Monitoring Reads Cgroups

When you run kubectl top pods, here’s the chain:

  1. kubectl top queries the metrics API.
  2. metrics-server scrapes the kubelet’s /metrics/resource endpoint.
  3. The kubelet reads cgroup stats such as cpu.stat and memory.current from each container’s cgroup directory.
  4. cAdvisor (embedded in the kubelet) continuously scrapes the cgroup filesystem for detailed metrics.

On cgroupsv2, cAdvisor reads from a single unified hierarchy. On v1, it had to stitch together data from multiple mount points and correlate by container ID or cgroup path. The v2 path is faster, simpler, and has fewer failure modes.


Operating It

Which version am I running?

Terminal window
# Linux only. Returns "cgroup2fs" for v2, "tmpfs" for v1.
stat -fc %T /sys/fs/cgroup/
# Or look for the unified hierarchy marker
test -f /sys/fs/cgroup/cgroup.controllers && echo "v2" || echo "v1"

As of 2026, cgroupsv2 is the default on the mainstream modern Linux node images: Ubuntu 22.04+ (21.10+ upstream), Fedora 31+, RHEL 9+, Debian 11+, Flatcar Container Linux, Talos Linux, and other current container-focused distributions. Managed Kubernetes has mostly moved too: GKE defaults to v2 for 1.26+ node pools, AKS for 1.25+, and new EKS managed node groups on Kubernetes 1.30+ default to AL2023, which uses cgroup v2. Existing clusters and pinned node images can still be on v1, so check before migrating workloads.

Migration Gotchas

  • Monitoring tools with hardcoded v1 paths. Anything reading from /sys/fs/cgroup/memory/docker/ is v1-specific and will break on v2. Most modern stacks handle both, but custom scripts and older agents may need updates.
  • Container runtimes. containerd 1.4+ and CRI-O 1.20+ have v2 support. Docker Engine 20.10+ works with v2 when using the systemd cgroup driver. runc 1.0.0+ has full v2 support; older versions are incomplete.
  • JVM applications. The JVM’s cgroup-aware memory detection was rewritten for v2. JDKs older than 15, 11.0.16, or 8u372 do not detect v2 memory limits at all; they fall back to host memory and will quietly try to use far more memory than the cgroup allows. This is one of the most common “my Java app OOMs on new nodes” issues, and in 2026 there is no excuse for still being on those JVM versions — upgrade.
  • Hybrid mode. Booting with systemd.unified_cgroup_hierarchy=0 (or older distro defaults) mounts the v2 unified hierarchy alongside the legacy v1 per-controller hierarchies, so processes show up in both trees at once. It works, but it’s explicitly a transition mechanism, not a destination — accounting splits across two interfaces and tools have to guess which one is authoritative. Pick one and commit.

PSI in Practice

The PSI metrics from earlier aren’t only useful for ad-hoc cat commands. Kubernetes added kubelet PSI metrics as an alpha feature in 1.33, promoted them to beta in 1.34, and made them stable and enabled by default in 1.36. On cgroupsv2 nodes with PSI enabled in the kernel, the kubelet exposes node-, pod-, and container-level pressure data through the Summary API and /metrics/cadvisor.

eBPF

One bonus from v2’s unified hierarchy: BPF programs that attach to cgroup paths (for network filtering, socket operations, device policy) now have an unambiguous path to attach to. Cilium, Calico, and other CNI plugins that use eBPF for cgroup-aware policy get this for free.


Where We’re Going

cgroupsv1 is legacy. The kernel’s authoritative cgroup documentation is now the v2 interface, Kubernetes deprecated cgroup v1 in 1.35, and new feature work targets v2. v2’s explicit controller delegation is also what makes rootless containers viable: systemd hands an unprivileged user their own subtree via a user-level .slice, with only the controllers the admin chose to delegate, and tools like Podman and rootless Docker drive resource limits inside it without ever touching root-owned cgroup files. What’s ahead is incremental: better NUMA-aware memory placement, finer delegation primitives for those rootless trees. But the structural story is settled.

Much like Kubernetes is YAML and controllers, all the way down; cgroups are directories and files on the surface, with kernel code doing the work of every write underneath.

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Related Articles

Related Videos