nice, ionice & cgroups
A backup job is eating the CPU your service needs. A batch import is hammering the disk
during peak traffic. A container with healthy-looking average CPU keeps blowing its p99.
All three are one question wearing different clothes: how do I make this process
matter less — or more — and why is mine being throttled? Linux has answered that
question three times: nice is a hint, ionice is a hint for the
disk, and cgroups are a fence. This page covers the five invocations worth knowing, what
a nice value really does to scheduler weight, the one cgroup counter that explains
container latency spikes, and a drill you can run without breaking anything.
The question it answers
Every machine you run is a fight over shared resources, and most of the time the kernel referees that fight well enough that you never think about it. Then one day a cron job starts at 02:00, your service's latency triples, and you discover that the default rule — everybody gets a fair share — is exactly the wrong rule for this box. The service is the point of the machine. The backup is a guest. You want a way to say so.
Linux gives you three generations of the same idea, and they are not interchangeable.
nice is the oldest, inherited from early Unix: a single number per process,
-20 to 19, that tilts the CPU scheduler's idea of fairness. It is a hint. It promises
nothing, costs nothing, and only means anything when two processes actually want the CPU
at the same moment. ionice is the same thought applied to the disk: a class
and a level that tell the IO scheduler whose reads and writes should go first when the
queue is full. Also a hint, and — as we will see — a hint that some IO schedulers ignore
entirely. cgroups are the modern answer and a different kind of thing: not a hint but a
fence. A cgroup can say "this group of processes gets at most half a CPU, at most 2 GB
of memory, this much IO bandwidth," and the kernel enforces it whether or not anyone else
wants the resource.
The fence is what containers are made of. Every Kubernetes pod, every Docker container,
every systemd service on a modern distribution lives inside a cgroup, and the limits you
write in a pod spec become numbers in files under /sys/fs/cgroup. Which means
the second half of this page's question — "why is my container being throttled?" — is not
a container question at all. It is a cgroup question, it has a precise answer, and the
answer is sitting in a file called cpu.stat waiting for you to read it. Most
engineers never do, and spend an afternoon blaming the network instead.
Five invocations that cover the work
Between the three tools there is a lot of surface area, but day to day you need five moves: start something polite, make something already running polite, demote a disk hog, put a hard cap on a process without writing a config file, and read the throttling counters when a capped process misbehaves.
| Invocation | What it does | When you reach for it |
|---|---|---|
nice -n 10 cmd | Starts cmd with nice 10: lower CPU weight, yields under contention | Batch jobs, backups, anything that should lose every CPU argument |
renice -n 5 -p PID | Changes the nice value of a running process (and -g for a group, -u for a user) | The job is already running and already hurting; you cannot restart it |
ionice -c2 -n7 cmd | Best-effort IO class, lowest level; -c3 is idle: disk only when nobody else wants it | Disk-heavy batch work next to a latency-sensitive service |
systemd-run --scope -p CPUQuota=50% cmd | Runs cmd in a fresh cgroup capped at half a CPU — no files to edit | Hard guarantees: the process cannot exceed the cap even on an idle box |
cat /sys/fs/cgroup/<path>/cpu.stat | Throttling counters for a cgroup: nr_throttled, throttled_usec | A container with CPU limits has bad tail latency and a clean-looking average |
One more line glues these together: cat /proc/PID/cgroup tells you which
cgroup a process belongs to, which is how you find the right cpu.stat to
read in the first place. On a cgroup v2 system the output is a single line starting with
0:: followed by a path; append that path to /sys/fs/cgroup and
you are standing in the directory that controls the process's world.
nice -n 19 ionice -c3 tar czf /backup/snap.tgz /data. CPU weight at the
floor, disk access only when the queue is otherwise empty. If you need a guarantee
instead of a courtesy, wrap it in systemd-run with a quota — the three
compose without conflict because they act on different mechanisms.Reading the output
What top shows, and what nice 10 really buys
Start with the view you already have open during an incident. Pin two CPU-burning processes to the same core — contention is the whole point — give one of them nice 10, and look at top:
$ taskset -c 0 yes > /dev/null & $ taskset -c 0 nice -n 10 yes > /dev/null & $ top -b -n 1 -p 7011,7014 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7011 deploy 20 0 5316 640 576 R 90.3 0.0 1:12.20 yes 7014 deploy 30 10 5316 644 580 R 9.7 0.0 0:07.71 yes
Two columns matter. NI is the nice value you set. PR is the kernel's priority number, which for normal processes is just 20 + NI — so nice 0 shows PR 20 and nice 10 shows PR 30, and the column is telling you the same thing twice. The interesting number is %CPU: 90.3 against 9.7, not 50/50 and not 100/0. Nice 10 did not pause the second process or cap it at some percentage. It changed its weight, and the scheduler divided the contested core in proportion to the weights.
The weights themselves are worth seeing once because they explain the ratio. The kernel maps each nice level to a weight from a fixed table, scaled so that each step of one nice level changes a process's share by roughly 10% relative to its neighbours — a factor of about 1.25 per level. Nice 0 is weight 1024. Nice 10 is weight 110. Nice -5 is weight 3121. Put a 1024 next to a 110 on one core and the split is 1024/1134 against 110/1134 — the 90/10 you just watched. Put nice -5 next to nice 0 and the favoured process gets about 75%. The numbers feel arbitrary until you see them as one fixed ratio compounded: ten steps of 1.25 is roughly 9.3x, and 9.3-to-1 is your 90/10.
cpu.stat, line by line
Now the fence. When a cgroup has a CPU quota, the kernel accounts for time in fixed
periods — 100 ms by default — and when the group has burned its quota for the current
period, every runnable thread in it is taken off the CPU until the next period starts.
That word "throttled" is not a metaphor. The processes are paused, mid-request, and the
place where the kernel admits to doing it is cpu.stat:
$ cat /sys/fs/cgroup/system.slice/run-r4f8a1.scope/cpu.stat usage_usec 48731022 user_usec 48214110 system_usec 516912 nr_periods 974 nr_throttled 951 throttled_usec 41216339 nr_bursts 0 burst_usec 0
Read it from the top. usage_usec is total CPU time the group has consumed, in
microseconds, split into user_usec and system_usec below it —
useful, but not the headline. nr_periods counts how many 100 ms
accounting windows have elapsed while the group had runnable work. nr_throttled
counts how many of those windows ended with the group forcibly paused because it had spent
its quota. This is the container number — the single most useful line in the file. Here it
reads 951 out of 974: in 97% of periods, this workload hit its cap and got benched.
throttled_usec is the total time spent in that benched state — 41 seconds of
wall-clock time during which threads were runnable, had work, and were not allowed to run.
Every one of those microseconds came out of somebody's request latency.
A healthy capped workload shows nr_throttled at or near zero, or growing so
slowly it does not matter. A workload that is throttled in most periods is misconfigured,
under-provisioned, or both — and the trap is that its average CPU usage can look modest
the whole time, because the average includes all the time it spent forcibly idle. The
counters only ever increase, so what you actually watch is the rate: read the file twice,
thirty seconds apart, and difference the numbers.
Three production scenarios
The 02:00 backup that starves the service
Latency alerts fire in the small hours, and the graphs show a clean square wave: bad from
02:00 to 03:40, fine before and after. On the box, top shows the backup
pegging two cores and iowait climbing — the service is losing the CPU argument and the
disk argument at once. (If the picture is murkier than that, the systematic walk lives in
what's eating my CPU?.) The
immediate fix does not need a restart:
$ renice -n 19 -p 51208 51208 (process ID) old priority 0, new priority 19 $ ionice -c3 -p 51208 $ ionice -p 51208 idle
renice drops the CPU weight to the floor while the job keeps running;
ionice -c3 -p moves its disk access to the idle class, meaning it gets IO
service only when nobody else is asking. The backup now finishes later — possibly much
later — and that is the trade you are explicitly making: courtesy hints sacrifice the
guest's completion time to protect the host's latency. The durable fix is to bake both
into the cron line (nice -n 19 ionice -c3 backup.sh) or, better, into the
systemd unit that runs it, with Nice=19 and IOSchedulingClass=idle.
And note what the hints cannot do: if the backup also fills the page cache or saturates
the network, nice and ionice are silent on both. They cover CPU and disk queueing,
nothing else.
The container with clean averages and ugly p99s
A service in Kubernetes has a CPU limit of 1. Average utilisation sits at 40%, the dashboards are green, and yet p99 latency spikes hard several times a minute. The team suspects the network, then garbage collection, then a noisy neighbour. The real culprit is arithmetic. A CPU limit of 1 becomes a cgroup quota of 100 ms per 100 ms period. The service runs 8 worker threads. When a burst of requests lands, 8 threads run in parallel and burn the entire period's quota in 100/8 = 12.5 ms — and then the whole container is throttled for the remaining 87.5 ms. Any request that arrives during the freeze waits for the next period before a single instruction of it runs. Averaged over a second, the container used 0.4 cores and looks healthy. Inside each period, it sprinted and then stood completely still.
The diagnosis takes one minute once you know where to look. Find the pod's cgroup, read
cpu.stat, and check nr_throttled against nr_periods.
If the ratio is high, you are not guessing anymore. The fixes are all reasonable and all
have costs: raise the limit so the quota fits the burst, shrink the thread pool so the
burst fits the quota, or remove the CPU limit entirely and rely on requests. That last
option is a long-running argument in the Kubernetes world, and the honest version of it
is short: requests are weights (they become cpu.weight, the cgroup cousin of
nice) and already protect neighbours under contention, so for latency-sensitive services
many operators drop CPU limits and accept burstable usage; but limits still earn their
keep when you need predictable performance for capacity planning, multi-tenant fairness
that holds even when the box is idle, or protection from a runaway workload that scales
with whatever it is given. Both camps are right about different clusters. What is not
defensible is setting a limit and never once reading cpu.stat.
memory.max and the two OOM killers
CPU quotas throttle; memory limits kill. When a cgroup's usage hits memory.max
and the kernel cannot reclaim enough from it, the cgroup OOM killer picks a process
inside that group and sends it an unblockable
SIGKILL. This is a different event
from the global OOM killer, which fires when the whole machine is out of memory and hunts
across every process on the host using its badness score. The distinction matters during
a postmortem: a cgroup OOM kill means your limit was the wall — the host may have
had plenty of free memory at the time — while a global OOM kill means the machine itself
was drowning and your process may simply have been the unlucky giant. The kernel log
line tells you which you got (oom-kill with a cgroup path versus
Out of memory), and the cgroup keeps its own tally:
$ cat /sys/fs/cgroup/system.slice/myapp.service/memory.events low 0 high 0 max 1842 oom 7 oom_kill 7
oom_kill 7 means seven processes have died at this fence. In Kubernetes this
surfaces as OOMKilled in the pod status, and the right response is rarely
"raise the limit and hope": find out whether the workload's real footprint grew, whether
a leak is compounding, or whether the limit was a guess somebody made before the
service had ever seen production traffic. There is also a softer dial,
memory.high, which throttles and reclaims aggressively instead of killing —
useful as an early-warning fence set below memory.max.
What's underneath
None of these knobs is magic, and the mental model gets much firmer once you see what each one actually moves. Start with nice. The Linux scheduler does not maintain a priority queue where higher-priority tasks always run first — that is real-time scheduling, a different policy. For normal tasks it runs a fair scheduler (CFS for many years; EEVDF in recent kernels, same idea with sharper latency math) that tracks each task's virtual runtime: the CPU time the task has consumed, scaled by its weight. A heavy task's clock ticks slowly, so the scheduler — which always wants to run whoever is furthest behind — keeps coming back to it. A nice 19 task's clock races, so it always looks well-fed and rarely gets picked while anyone else is waiting. That is the whole trick: nice changes the exchange rate between real CPU time and virtual time, and fairness does the rest. The full machinery — vruntime, timeslices, the run queue — is covered in scheduling, and you can watch weights fight each other interactively in the scheduler simulator.
cgroups graft a hierarchy onto this. Under cgroup v2 — the unified hierarchy, one tree
for all controllers, mounted at /sys/fs/cgroup — every process lives in
exactly one node of one tree, and resources are divided down the tree. Two files do the
CPU work. cpu.weight (default 100, range 1 to 10000) is nice for groups: a
proportional share that only matters under contention, exactly the same vruntime trick
applied to a whole subtree at once. cpu.max is the fence: two numbers,
quota and period, both in microseconds. 50000 100000 means "50 ms of
CPU time per 100 ms window," which is the file systemd-run -p CPUQuota=50%
writes for you, and the file a Kubernetes CPU limit ultimately becomes. The quota is a
budget shared by every thread in the group across every core, which is why thread count
matters so much in the throttling scenario: the budget drains in parallel.
$ cat /proc/self/cgroup 0::/user.slice/user-1000.slice/session-3.scope $ cat /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cpu.max max 100000
That 0:: prefix is the cgroup v2 signature — one line, one tree. (A v1
system prints a stack of lines, one per controller, each with its own path; more on that
in the pitfalls.) The value max 100000 means no quota: this group can use
every core in the machine, limited only by weights. Replace max with a
number and the fence goes up. Everything systemd does with CPUQuota=,
MemoryMax=, and IOWeight=, and everything a container runtime
does with pod limits, bottoms out in writes to files like this one. The file is
the interface; the tools are stationery.
Pitfalls
Expecting nice to do anything on an idle box. Nice is purely about
contention. A nice 19 process on an otherwise idle machine gets 100% of a core, full
speed, no penalty — the weight only matters when someone else with a better weight is
runnable on the same CPU. This cuts both ways: it is why nicing a batch job is free
insurance (it costs the job nothing at night, protects the service during the day), and
it is why nice can never be a cap. If you tested nice -n 19 on a quiet
staging box and saw no effect, the test was wrong, not the tool.
Assuming ionice always works. The IO priority hint is consumed by the
block-layer IO scheduler, and only some of them care. BFQ honours classes and levels
fully; mq-deadline largely ignores them; none — common and often correct on
fast NVMe drives — ignores them by definition, because there is no scheduler to consult.
Check what a disk is using with cat /sys/block/nvme0n1/queue/scheduler
before trusting an ionice to protect you. The idle class (-c3)
degrades gracefully — where unsupported it just does little — but "the backup is
ioniced" is not evidence of anything until you know the scheduler underneath.
cgroup v1 paths on a v2 mental model, and vice versa. The two
hierarchies have different filenames for the same ideas, and copy-pasting advice across
them fails silently. v1: cpu.cfs_quota_us and cpu.cfs_period_us
as separate files, cpu.shares (default 1024) for weight, a separate tree per
controller. v2: cpu.max holding both numbers, cpu.weight
(default 100), one tree. /proc/PID/cgroup tells you which world you are in:
one 0:: line is v2; many numbered lines is v1 or a hybrid. Older
Kubernetes nodes and older container images are the usual place this bites — a debugging
runbook written for one hierarchy reads like nonsense inside the other.
renice down requires privilege — and "down" is anything lower. Raising a
nice value (making a process more polite) is open to its owner. Lowering it — including
back to where it was — needs root or CAP_SYS_NICE. Renice your own process
from 0 to 10 and you cannot return it to 0; the courtesy is a one-way door for
unprivileged users (unless RLIMIT_NICE has been raised for you, which on
most fleets it has not). The practical consequence: experiment with nice on
fresh throwaway processes, not by renicing something you care about.
Reading cpu.stat once and concluding nothing. The counters are
cumulative since the cgroup was created. nr_throttled 951 might be three
weeks of history or the last ninety seconds. Sample twice, subtract, divide by the
interval — throttled periods per second is the signal. The same applies to
memory.events: an oom_kill count means little without knowing
when it last moved.
A drill you can run right now
Everything below is safe on any Linux machine with systemd: it burns a little CPU on
purpose, creates nothing permanent, and every process dies with a Ctrl-C or a
kill. Fifteen minutes, and the three ideas — weight, quota, hierarchy —
become things you have watched happen rather than read about.
Step 1 — watch a weight fight. Pin two busy loops to one core, handicap
one of them, and watch top divide the spoils:
$ taskset -c 0 yes > /dev/null & [1] 7011 $ taskset -c 0 nice -n 10 yes > /dev/null & [2] 7014 $ top -p 7011,7014 # expect ~90% / ~10%, then in another shell: $ sudo renice -n -5 -p 7014 7014 (process ID) old priority 10, new priority -5 $ kill %1 %2
With nice 0 against nice 10 you should see the 90/10 split from the weights table. The
renice to -5 flips the fight: the formerly polite process now holds weight
3121 against 1024 and takes roughly three quarters of the core, live, with no restart —
watch the %CPU columns trade places over a few refreshes. Note the sudo:
going to a negative nice value is the privileged direction. Kill both loops when the
numbers stop being interesting.
Step 2 — build a fence and watch it throttle. Cap a scope at half a core, stuff two cores' worth of demand inside it, and read the damage:
$ sudo systemd-run --scope -p CPUQuota=50% -- sh -c 'yes > /dev/null & yes > /dev/null & wait' Running scope as unit: run-r4f8a1.scope # in another terminal: $ grep -E 'nr_periods|nr_throttled|throttled_usec' /sys/fs/cgroup/system.slice/run-r4f8a1.scope/cpu.stat nr_periods 118 nr_throttled 118 throttled_usec 8841205 $ sleep 30; grep nr_throttled /sys/fs/cgroup/system.slice/run-r4f8a1.scope/cpu.stat nr_throttled 418
Two yes loops want two full cores; the quota allows half of one. Every single
period ends in throttling — nr_throttled tracks nr_periods
almost one for one, climbing by ten per second, which is the rate measurement from the
pitfalls done by hand. Meanwhile top shows each yes at about
25%: the fence holding. This is your Kubernetes p99 scenario in a bottle, minus the
pager. Ctrl-C the systemd-run in the first terminal and the scope and
everything in it goes away.
Step 3 — find your own cage. You have been inside a cgroup this whole time:
$ cat /proc/$$/cgroup 0::/user.slice/user-1000.slice/session-3.scope $ cat /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cpu.max max 100000 $ cat /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cpu.stat | head -4 usage_usec 184220471 user_usec 121448210 system_usec 62772261 nr_periods 0
Your shell session is a scope under your user slice; cpu.max reads
max, so no fence, and nr_periods 0 confirms quota accounting
has never engaged. On a laptop this is unremarkable. The point of looking is that the
next time you exec into a production container and run the same two commands, you will
recognise the shape instantly — a real path, a real quota, and a cpu.stat
with a story in it.
nice -n 19 ionice -c3 cmd to make
a job polite, systemd-run --scope -p CPUQuota=50% cmd to make a cap, and
grep throttled /sys/fs/cgroup/<path>/cpu.stat when a limited container
has clean averages and dirty tail latency.Further reading
- sched(7) — scheduling policies, the nice value, and its interaction with autogrouping, all in one authoritative place.
- cgroups(7) — the v1/v2 split, the unified hierarchy, and the controller interface files this page reads.
- CFS bandwidth control — kernel documentation — quota, period, burst, and the exact semantics behind nr_throttled.
- systemd.resource-control(5) — CPUQuota, CPUWeight, MemoryMax, IOWeight: every cgroup knob as a unit directive.
- Dave Chiluk — "Throttling: New Developments in Application Performance with CPU Limits" — the talk that made Kubernetes CPU throttling a mainstream diagnosis, including the kernel bug hunt.