21 / 28

Linux / 21

nice, ionice & cgroups

A backup job is eating the CPU your service needs. A batch import is hammering the disk during peak traffic. A container with healthy-looking average CPU keeps blowing its p99. All three are one question wearing different clothes: how do I make this process matter less — or more — and why is mine being throttled? Linux has answered that question three times: nice is a hint, ionice is a hint for the disk, and cgroups are a fence. This page covers the five invocations worth knowing, what a nice value really does to scheduler weight, the one cgroup counter that explains container latency spikes, and a drill you can run without breaking anything.

The question it answers

Every machine you run is a fight over shared resources, and most of the time the kernel referees that fight well enough that you never think about it. Then one day a cron job starts at 02:00, your service's latency triples, and you discover that the default rule — everybody gets a fair share — is exactly the wrong rule for this box. The service is the point of the machine. The backup is a guest. You want a way to say so.

Linux gives you three generations of the same idea, and they are not interchangeable. nice is the oldest, inherited from early Unix: a single number per process, -20 to 19, that tilts the CPU scheduler's idea of fairness. It is a hint. It promises nothing, costs nothing, and only means anything when two processes actually want the CPU at the same moment. ionice is the same thought applied to the disk: a class and a level that tell the IO scheduler whose reads and writes should go first when the queue is full. Also a hint, and — as we will see — a hint that some IO schedulers ignore entirely. cgroups are the modern answer and a different kind of thing: not a hint but a fence. A cgroup can say "this group of processes gets at most half a CPU, at most 2 GB of memory, this much IO bandwidth," and the kernel enforces it whether or not anyone else wants the resource.

The fence is what containers are made of. Every Kubernetes pod, every Docker container, every systemd service on a modern distribution lives inside a cgroup, and the limits you write in a pod spec become numbers in files under /sys/fs/cgroup. Which means the second half of this page's question — "why is my container being throttled?" — is not a container question at all. It is a cgroup question, it has a precise answer, and the answer is sitting in a file called cpu.stat waiting for you to read it. Most engineers never do, and spend an afternoon blaming the network instead.

Five invocations that cover the work

Between the three tools there is a lot of surface area, but day to day you need five moves: start something polite, make something already running polite, demote a disk hog, put a hard cap on a process without writing a config file, and read the throttling counters when a capped process misbehaves.

Invocation	What it does	When you reach for it
`nice -n 10 cmd`	Starts `cmd` with nice 10: lower CPU weight, yields under contention	Batch jobs, backups, anything that should lose every CPU argument
`renice -n 5 -p PID`	Changes the nice value of a running process (and `-g` for a group, `-u` for a user)	The job is already running and already hurting; you cannot restart it
`ionice -c2 -n7 cmd`	Best-effort IO class, lowest level; `-c3` is idle: disk only when nobody else wants it	Disk-heavy batch work next to a latency-sensitive service
`systemd-run --scope -p CPUQuota=50% cmd`	Runs `cmd` in a fresh cgroup capped at half a CPU — no files to edit	Hard guarantees: the process cannot exceed the cap even on an idle box
`cat /sys/fs/cgroup/<path>/cpu.stat`	Throttling counters for a cgroup: `nr_throttled`, `throttled_usec`	A container with CPU limits has bad tail latency and a clean-looking average

One more line glues these together: cat /proc/PID/cgroup tells you which cgroup a process belongs to, which is how you find the right cpu.stat to read in the first place. On a cgroup v2 system the output is a single line starting with 0:: followed by a path; append that path to /sys/fs/cgroup and you are standing in the directory that controls the process's world.

Stack the hints. For the classic "run this big job without hurting anything" move, the polite incantation combines both hints: nice -n 19 ionice -c3 tar czf /backup/snap.tgz /data. CPU weight at the floor, disk access only when the queue is otherwise empty. If you need a guarantee instead of a courtesy, wrap it in systemd-run with a quota — the three compose without conflict because they act on different mechanisms.

Reading the output

What top shows, and what nice 10 really buys

Start with the view you already have open during an incident. Pin two CPU-burning processes to the same core — contention is the whole point — give one of them nice 10, and look at top:

$ taskset -c 0 yes > /dev/null &
$ taskset -c 0 nice -n 10 yes > /dev/null &
$ top -b -n 1 -p 7011,7014
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 7011 deploy    20   0    5316    640    576 R  90.3   0.0   1:12.20 yes
 7014 deploy    30  10    5316    644    580 R   9.7   0.0   0:07.71 yes

Two columns matter. NI is the nice value you set. PR is the kernel's priority number, which for normal processes is just 20 + NI — so nice 0 shows PR 20 and nice 10 shows PR 30, and the column is telling you the same thing twice. The interesting number is %CPU: 90.3 against 9.7, not 50/50 and not 100/0. Nice 10 did not pause the second process or cap it at some percentage. It changed its weight, and the scheduler divided the contested core in proportion to the weights.

The weights themselves are worth seeing once because they explain the ratio. The kernel maps each nice level to a weight from a fixed table, scaled so that each step of one nice level changes a process's share by roughly 10% relative to its neighbours — a factor of about 1.25 per level. Nice 0 is weight 1024. Nice 10 is weight 110. Nice -5 is weight 3121. Put a 1024 next to a 110 on one core and the split is 1024/1134 against 110/1134 — the 90/10 you just watched. Put nice -5 next to nice 0 and the favoured process gets about 75%. The numbers feel arbitrary until you see them as one fixed ratio compounded: ten steps of 1.25 is roughly 9.3x, and 9.3-to-1 is your 90/10.

The same contested core under two weight assignments. Each nice step shifts the ratio by about 1.25x; ten steps compound to roughly 9-to-1. Nothing is capped — if the service goes idle, the backup takes the whole pie.

cpu.stat, line by line

Now the fence. When a cgroup has a CPU quota, the kernel accounts for time in fixed periods — 100 ms by default — and when the group has burned its quota for the current period, every runnable thread in it is taken off the CPU until the next period starts. That word "throttled" is not a metaphor. The processes are paused, mid-request, and the place where the kernel admits to doing it is cpu.stat:

$ cat /sys/fs/cgroup/system.slice/run-r4f8a1.scope/cpu.stat
usage_usec 48731022
user_usec 48214110
system_usec 516912
nr_periods 974
nr_throttled 951
throttled_usec 41216339
nr_bursts 0
burst_usec 0

Read it from the top. usage_usec is total CPU time the group has consumed, in microseconds, split into user_usec and system_usec below it — useful, but not the headline. nr_periods counts how many 100 ms accounting windows have elapsed while the group had runnable work. nr_throttled counts how many of those windows ended with the group forcibly paused because it had spent its quota. This is the container number — the single most useful line in the file. Here it reads 951 out of 974: in 97% of periods, this workload hit its cap and got benched. throttled_usec is the total time spent in that benched state — 41 seconds of wall-clock time during which threads were runnable, had work, and were not allowed to run. Every one of those microseconds came out of somebody's request latency.

A healthy capped workload shows nr_throttled at or near zero, or growing so slowly it does not matter. A workload that is throttled in most periods is misconfigured, under-provisioned, or both — and the trap is that its average CPU usage can look modest the whole time, because the average includes all the time it spent forcibly idle. The counters only ever increase, so what you actually watch is the rate: read the file twice, thirty seconds apart, and difference the numbers.

Three production scenarios

The 02:00 backup that starves the service

Latency alerts fire in the small hours, and the graphs show a clean square wave: bad from 02:00 to 03:40, fine before and after. On the box, top shows the backup pegging two cores and iowait climbing — the service is losing the CPU argument and the disk argument at once. (If the picture is murkier than that, the systematic walk lives in what's eating my CPU?.) The immediate fix does not need a restart:

$ renice -n 19 -p 51208
51208 (process ID) old priority 0, new priority 19
$ ionice -c3 -p 51208
$ ionice -p 51208
idle

renice drops the CPU weight to the floor while the job keeps running; ionice -c3 -p moves its disk access to the idle class, meaning it gets IO service only when nobody else is asking. The backup now finishes later — possibly much later — and that is the trade you are explicitly making: courtesy hints sacrifice the guest's completion time to protect the host's latency. The durable fix is to bake both into the cron line (nice -n 19 ionice -c3 backup.sh) or, better, into the systemd unit that runs it, with Nice=19 and IOSchedulingClass=idle. And note what the hints cannot do: if the backup also fills the page cache or saturates the network, nice and ionice are silent on both. They cover CPU and disk queueing, nothing else.

The container with clean averages and ugly p99s

A service in Kubernetes has a CPU limit of 1. Average utilisation sits at 40%, the dashboards are green, and yet p99 latency spikes hard several times a minute. The team suspects the network, then garbage collection, then a noisy neighbour. The real culprit is arithmetic. A CPU limit of 1 becomes a cgroup quota of 100 ms per 100 ms period. The service runs 8 worker threads. When a burst of requests lands, 8 threads run in parallel and burn the entire period's quota in 100/8 = 12.5 ms — and then the whole container is throttled for the remaining 87.5 ms. Any request that arrives during the freeze waits for the next period before a single instruction of it runs. Averaged over a second, the container used 0.4 cores and looks healthy. Inside each period, it sprinted and then stood completely still.

CFS bandwidth control with quota 100 ms, period 100 ms, 8 runnable threads. Each verdant block is the only time the container runs; each dashed gap increments nr_throttled and adds to throttled_usec.

The diagnosis takes one minute once you know where to look. Find the pod's cgroup, read cpu.stat, and check nr_throttled against nr_periods. If the ratio is high, you are not guessing anymore. The fixes are all reasonable and all have costs: raise the limit so the quota fits the burst, shrink the thread pool so the burst fits the quota, or remove the CPU limit entirely and rely on requests. That last option is a long-running argument in the Kubernetes world, and the honest version of it is short: requests are weights (they become cpu.weight, the cgroup cousin of nice) and already protect neighbours under contention, so for latency-sensitive services many operators drop CPU limits and accept burstable usage; but limits still earn their keep when you need predictable performance for capacity planning, multi-tenant fairness that holds even when the box is idle, or protection from a runaway workload that scales with whatever it is given. Both camps are right about different clusters. What is not defensible is setting a limit and never once reading cpu.stat.

memory.max and the two OOM killers

CPU quotas throttle; memory limits kill. When a cgroup's usage hits memory.max and the kernel cannot reclaim enough from it, the cgroup OOM killer picks a process inside that group and sends it an unblockable SIGKILL. This is a different event from the global OOM killer, which fires when the whole machine is out of memory and hunts across every process on the host using its badness score. The distinction matters during a postmortem: a cgroup OOM kill means your limit was the wall — the host may have had plenty of free memory at the time — while a global OOM kill means the machine itself was drowning and your process may simply have been the unlucky giant. The kernel log line tells you which you got (oom-kill with a cgroup path versus Out of memory), and the cgroup keeps its own tally:

$ cat /sys/fs/cgroup/system.slice/myapp.service/memory.events
low 0
high 0
max 1842
oom 7
oom_kill 7

oom_kill 7 means seven processes have died at this fence. In Kubernetes this surfaces as OOMKilled in the pod status, and the right response is rarely "raise the limit and hope": find out whether the workload's real footprint grew, whether a leak is compounding, or whether the limit was a guess somebody made before the service had ever seen production traffic. There is also a softer dial, memory.high, which throttles and reclaims aggressively instead of killing — useful as an early-warning fence set below memory.max.

What's underneath

None of these knobs is magic, and the mental model gets much firmer once you see what each one actually moves. Start with nice. The Linux scheduler does not maintain a priority queue where higher-priority tasks always run first — that is real-time scheduling, a different policy. For normal tasks it runs a fair scheduler (CFS for many years; EEVDF in recent kernels, same idea with sharper latency math) that tracks each task's virtual runtime: the CPU time the task has consumed, scaled by its weight. A heavy task's clock ticks slowly, so the scheduler — which always wants to run whoever is furthest behind — keeps coming back to it. A nice 19 task's clock races, so it always looks well-fed and rarely gets picked while anyone else is waiting. That is the whole trick: nice changes the exchange rate between real CPU time and virtual time, and fairness does the rest. The full machinery — vruntime, timeslices, the run queue — is covered in scheduling, and you can watch weights fight each other interactively in the scheduler simulator.

cgroups graft a hierarchy onto this. Under cgroup v2 — the unified hierarchy, one tree for all controllers, mounted at /sys/fs/cgroup — every process lives in exactly one node of one tree, and resources are divided down the tree. Two files do the CPU work. cpu.weight (default 100, range 1 to 10000) is nice for groups: a proportional share that only matters under contention, exactly the same vruntime trick applied to a whole subtree at once. cpu.max is the fence: two numbers, quota and period, both in microseconds. 50000 100000 means "50 ms of CPU time per 100 ms window," which is the file systemd-run -p CPUQuota=50% writes for you, and the file a Kubernetes CPU limit ultimately becomes. The quota is a budget shared by every thread in the group across every core, which is why thread count matters so much in the throttling scenario: the budget drains in parallel.

$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-3.scope
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cpu.max
max 100000

That 0:: prefix is the cgroup v2 signature — one line, one tree. (A v1 system prints a stack of lines, one per controller, each with its own path; more on that in the pitfalls.) The value max 100000 means no quota: this group can use every core in the machine, limited only by weights. Replace max with a number and the fence goes up. Everything systemd does with CPUQuota=, MemoryMax=, and IOWeight=, and everything a container runtime does with pod limits, bottoms out in writes to files like this one. The file is the interface; the tools are stationery.

Pitfalls

Expecting nice to do anything on an idle box. Nice is purely about contention. A nice 19 process on an otherwise idle machine gets 100% of a core, full speed, no penalty — the weight only matters when someone else with a better weight is runnable on the same CPU. This cuts both ways: it is why nicing a batch job is free insurance (it costs the job nothing at night, protects the service during the day), and it is why nice can never be a cap. If you tested nice -n 19 on a quiet staging box and saw no effect, the test was wrong, not the tool.

Assuming ionice always works. The IO priority hint is consumed by the block-layer IO scheduler, and only some of them care. BFQ honours classes and levels fully; mq-deadline largely ignores them; none — common and often correct on fast NVMe drives — ignores them by definition, because there is no scheduler to consult. Check what a disk is using with cat /sys/block/nvme0n1/queue/scheduler before trusting an ionice to protect you. The idle class (-c3) degrades gracefully — where unsupported it just does little — but "the backup is ioniced" is not evidence of anything until you know the scheduler underneath.

cgroup v1 paths on a v2 mental model, and vice versa. The two hierarchies have different filenames for the same ideas, and copy-pasting advice across them fails silently. v1: cpu.cfs_quota_us and cpu.cfs_period_us as separate files, cpu.shares (default 1024) for weight, a separate tree per controller. v2: cpu.max holding both numbers, cpu.weight (default 100), one tree. /proc/PID/cgroup tells you which world you are in: one 0:: line is v2; many numbered lines is v1 or a hybrid. Older Kubernetes nodes and older container images are the usual place this bites — a debugging runbook written for one hierarchy reads like nonsense inside the other.

renice down requires privilege — and "down" is anything lower. Raising a nice value (making a process more polite) is open to its owner. Lowering it — including back to where it was — needs root or CAP_SYS_NICE. Renice your own process from 0 to 10 and you cannot return it to 0; the courtesy is a one-way door for unprivileged users (unless RLIMIT_NICE has been raised for you, which on most fleets it has not). The practical consequence: experiment with nice on fresh throwaway processes, not by renicing something you care about.

Reading cpu.stat once and concluding nothing. The counters are cumulative since the cgroup was created. nr_throttled 951 might be three weeks of history or the last ninety seconds. Sample twice, subtract, divide by the interval — throttled periods per second is the signal. The same applies to memory.events: an oom_kill count means little without knowing when it last moved.

A drill you can run right now

Everything below is safe on any Linux machine with systemd: it burns a little CPU on purpose, creates nothing permanent, and every process dies with a Ctrl-C or a kill. Fifteen minutes, and the three ideas — weight, quota, hierarchy — become things you have watched happen rather than read about.

Step 1 — watch a weight fight. Pin two busy loops to one core, handicap one of them, and watch top divide the spoils:

$ taskset -c 0 yes > /dev/null &
[1] 7011
$ taskset -c 0 nice -n 10 yes > /dev/null &
[2] 7014
$ top -p 7011,7014   # expect ~90% / ~10%, then in another shell:
$ sudo renice -n -5 -p 7014
7014 (process ID) old priority 10, new priority -5
$ kill %1 %2

With nice 0 against nice 10 you should see the 90/10 split from the weights table. The renice to -5 flips the fight: the formerly polite process now holds weight 3121 against 1024 and takes roughly three quarters of the core, live, with no restart — watch the %CPU columns trade places over a few refreshes. Note the sudo: going to a negative nice value is the privileged direction. Kill both loops when the numbers stop being interesting.

Step 2 — build a fence and watch it throttle. Cap a scope at half a core, stuff two cores' worth of demand inside it, and read the damage:

$ sudo systemd-run --scope -p CPUQuota=50% -- sh -c 'yes > /dev/null & yes > /dev/null & wait'
Running scope as unit: run-r4f8a1.scope
# in another terminal:
$ grep -E 'nr_periods|nr_throttled|throttled_usec' /sys/fs/cgroup/system.slice/run-r4f8a1.scope/cpu.stat
nr_periods 118
nr_throttled 118
throttled_usec 8841205
$ sleep 30; grep nr_throttled /sys/fs/cgroup/system.slice/run-r4f8a1.scope/cpu.stat
nr_throttled 418

Two yes loops want two full cores; the quota allows half of one. Every single period ends in throttling — nr_throttled tracks nr_periods almost one for one, climbing by ten per second, which is the rate measurement from the pitfalls done by hand. Meanwhile top shows each yes at about 25%: the fence holding. This is your Kubernetes p99 scenario in a bottle, minus the pager. Ctrl-C the systemd-run in the first terminal and the scope and everything in it goes away.

Step 3 — find your own cage. You have been inside a cgroup this whole time:

$ cat /proc/$$/cgroup
0::/user.slice/user-1000.slice/session-3.scope
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cpu.max
max 100000
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cpu.stat | head -4
usage_usec 184220471
user_usec 121448210
system_usec 62772261
nr_periods 0

Your shell session is a scope under your user slice; cpu.max reads max, so no fence, and nr_periods 0 confirms quota accounting has never engaged. On a laptop this is unremarkable. The point of looking is that the next time you exec into a production container and run the same two commands, you will recognise the shape instantly — a real path, a real quota, and a cpu.stat with a story in it.

If you remember one line. nice -n 19 ionice -c3 cmd to make a job polite, systemd-run --scope -p CPUQuota=50% cmd to make a cap, and grep throttled /sys/fs/cgroup/<path>/cpu.stat when a limited container has clean averages and dirty tail latency.

nice, ionice & cgroups

The question it answers

Five invocations that cover the work

Reading the output

What top shows, and what nice 10 really buys

cpu.stat, line by line

Three production scenarios

The 02:00 backup that starves the service

The container with clean averages and ugly p99s

memory.max and the two OOM killers

What's underneath

Pitfalls

A drill you can run right now

Further reading

22 — ulimit & limits