ImagePullBackOff
The kubelet can’t pull the image and is backing off between attempts. The actual reason — bad tag, auth, rate limit, or network — is spelled out verbatim in the pod’s events, one kubectl describe away.
The symptom
The pod never starts a container. ErrImagePull and ImagePullBackOff are the same failure at different ages: the first pull attempt fails with ErrImagePull, and after a few retries the kubelet enters back-off and the status changes name.
$ kubectl get pods NAME READY STATUS RESTARTS AGE api-5c9b8f7d6b-mw8zr 0/1 ImagePullBackOff 0 8m ← RESTARTS stays 0: nothing has ever run, this is purely about fetching the image
The diagnosis
1 Read the event text — it names the cause
$ kubectl describe pod api-5c9b8f7d6b-mw8zr | tail -8
Events:
Normal Pulling 8m kubelet Pulling image "registry.corp.io/team/api:v1.4.2"
Warning Failed 8m kubelet Failed to pull image "registry.corp.io/team/api:v1.4.2":
rpc error: code = NotFound desc = manifest unknown
Warning Failed 8m kubelet Error: ErrImagePull
Normal BackOff 3m (x25 over 8m) kubelet Back-off pulling image "registry.corp.io/team/api:v1.4.2" Branch on the exact wording. "manifest unknown" / "not found": the tag doesn’t exist at that path — cause 1. "unauthorized" / "authentication required": credentials — cause 2. "toomanyrequests": registry rate limit — cause 3. "no such host" / "i/o timeout" / "connection refused": the node can’t reach the registry — cause 4. "no match for platform in manifest": architecture mismatch — cause 5. The registry wrote you a precise error; the only mistake is not reading it.
2 Reproduce the pull where the kubelet does it
$ ssh node-3
node-3$ sudo crictl pull registry.corp.io/team/api:v1.4.2
E0609 ... pulling image: rpc error: code = NotFound desc = manifest unknown
← same error from the node’s own runtime: the cluster config is innocent,
the image reference or the registry itself is the problem This splits the world in two. If the pull fails identically from the node with crictl, the problem is the image ref, the registry, or the node’s network — Kubernetes objects are irrelevant. If crictl pull works on the node but the pod still fails, the difference is credentials: the kubelet pulls with the pod’s imagePullSecrets, not with whatever login the node or your laptop has.
3 Check the secret exists where the pod lives
$ kubectl get pod api-5c9b8f7d6b-mw8zr -o jsonpath="{.spec.imagePullSecrets}"
[{"name":"regcred"}]
$ kubectl get secret regcred -n prod
Error from server (NotFound): secrets "regcred" not found
← referenced, but absent in THIS namespace — secrets do not cross namespaces Three things must line up: the secret exists in the pod’s namespace, the pod (or its service account) references it, and the registry hostname inside .dockerconfigjson exactly matches the hostname in the image ref. Check the service-account path too — kubectl get sa default -o yaml — because secrets attached there apply to every pod using that account, and a working pull on one namespace often turns out to ride on the SA, not the pod spec.
The causes, ranked
- 1 The tag doesn’t exist: typo, or CI built but never pushed
confirm "manifest unknown" or "not found" in the events; listing the repo’s tags in the registry shows the gap.
- 2 Private registry, missing or wrong imagePullSecret
confirm "unauthorized: authentication required" in the events; crictl pull from the node fails the same way without creds.
- 3 Registry rate limiting
confirm "toomanyrequests: You have reached your pull rate limit" (Docker Hub’s wording) — typically appears fleet-wide at once, worst right after a rollout or a node-pool scale-up.
- 4 The node can’t reach or resolve the registry
confirm "no such host", "i/o timeout"; crictl pull fails on the affected node but works elsewhere.
- 5 Architecture mismatch
confirm "no match for platform in manifest: not found" — classic on arm64 node pools pulling amd64-only images.
The fixes
Fix the reference, or fix the pipeline ordering so the push completes before the deploy starts. Deploying by digest (image@sha256:…) makes "deployed something other than what CI built" impossible, at the cost of needing automation to bump the digest.
Create the secret in the pod’s namespace (kubectl create secret docker-registry) and reference it from the pod or its service account. The auth entry’s hostname must match the image ref exactly — credentials for registry.corp.io do nothing for an image pulled as registry.corp.io:5000.
Authenticate pulls (limits are per-account rather than per-IP once logged in), run a pull-through cache in your infrastructure, or copy base images into your own registry. A NAT gateway makes this worse: every node shares one anonymous IP’s quota.
Node-level debugging: DNS first (resolv.conf on the node, then walk it with dig), then egress — proxies, firewall rules, and whether the registry needs to be on an allow-list. If only new nodes fail, the node image or its proxy config drifted.
Build multi-arch images (docker buildx with --platform), or constrain the workload to matching nodes with a nodeSelector on kubernetes.io/arch until you do.
What people get wrong
- IfNotPresent plus a mutable tag is silent skew. Nodes that already cached :latest (or any reused tag) keep running the old bits and never pull; only fresh nodes fetch the new image — or fail trying. Half your fleet on the old build with no error anywhere. Unique tags or digests per release end this class of incident.
- Your laptop’s docker login proves nothing. The kubelet pulls with the pod’s imagePullSecrets (or the runtime’s node-level config), never with your ~/.docker/config.json. "Works on my machine" for image pulls is exactly the credential difference the diagnosis in step 2 isolates.
- The back-off keeps pods Pending long after you fixed it. Image pull back-off caps at five minutes, so a fixed registry or secret may not be retried for several minutes. Deleting the pod (or kubectl rollout restart) retries immediately — useful after you’ve actually fixed the cause, useless before.
Quick answers
What’s the difference between ErrImagePull and ImagePullBackOff?
Same failure, different age. ErrImagePull is the status during/right after a failed pull attempt; once the kubelet starts waiting between retries it reports ImagePullBackOff. The cause is whatever the Failed event says — the status name itself carries no extra information.
How do I test whether my image pull secret works?
Decode it (kubectl get secret regcred -o jsonpath of .dockerconfigjson, base64 -d) and check the registry hostname matches the image ref exactly. Then pull from a node with the same creds via crictl, or run a one-off pod that uses only that secret. If the node pulls fine with creds but the pod fails, the secret is missing from the pod’s namespace or not referenced.
Why do only new nodes fail to pull?
Old nodes have the image cached, so with imagePullPolicy IfNotPresent they never contact the registry. New nodes must pull, so they alone hit the broken auth, the deleted tag, or the rate limit. The fleet looks half-broken; the registry path was broken the whole time.