05 / 09

Azure / 05

AKS

Azure Kubernetes Service draws one line through a cluster: Microsoft runs the control plane, you run the nodes. Everything interesting about AKS lives on either side of that line — which tier of control plane you pay for, how node pools map onto VM scale sets, which network plugin decides whether you run out of IP addresses, and how Entra ID replaces both your kubeconfig certificates and your pod credentials. This page walks the whole machine, compares it honestly with GKE and EKS, and ends with a lab you can run with nothing but az and kubectl.

The control plane you never see

In a self-hosted cluster, the control plane — the API server, etcd, the scheduler, the controller manager — runs on machines you provision, patch, back up, and wake up for at 3am. (If those components are fuzzy, the Kubernetes architecture page covers what each one does.) In AKS, that entire layer runs inside a Microsoft-managed subscription. You cannot SSH into the API server, you cannot see the etcd VMs, and you do not pay for them as compute. What you get is an endpoint, a kubeconfig, and a promise that the thing on the other end stays up and gets patched. Your bill is for the worker nodes, their disks, and their load balancers — ordinary Azure resources living in your own subscription.

The promise comes in two strengths. The Free tier gives you a managed control plane with no financially backed SLA: an uptime objective, not a guarantee, and capacity limits that make it suitable for dev and test clusters or anything small. The Standard tier attaches a paid, financially backed SLA — 99.95% for the API server when you use availability zones, 99.9% without — and provisions the control plane to scale further, supporting clusters up to thousands of nodes. The price is per cluster per hour, modest next to the node bill of any real production cluster, and choosing Free for production is the kind of saving that looks clever until the API server throttles your controllers during an incident and you have no SLA to point at. There is also a Premium tier that adds long-term support for Kubernetes versions past their community end-of-life, which matters mostly to teams whose upgrade cadence is slower than upstream's release train.

One consequence of the managed control plane is worth internalising early: you configure it only through the AKS API. Flags on the API server, admission controllers, etcd tuning — these are exposed as AKS features or not at all. When AKS supports something, it arrives as a cluster property or an add-on you toggle with az aks update. When it does not, there is no escape hatch, because the machines are not yours. That trade is the whole product: less control, far less toil.

Node pools: where your half lives

Below the line, an AKS cluster is a set of node pools, and each node pool is backed by a Virtual Machine Scale Set (VMSS) in a resource group AKS creates for you, usually named MC_<rg>_<cluster>_<region>. This is not an implementation detail to ignore. When the cluster autoscaler adds a node, it is raising the VMSS instance count. When an upgrade replaces a node, it is reimaging or replacing a VMSS instance. When you wonder why a node pool can only contain one VM size, the answer is that a scale set has one VM size. The Kubernetes abstraction sits directly on the VMSS abstraction, and odd behaviour at the node level almost always makes sense once you look at the scale set underneath.

The line through every AKS cluster. Above it, components you configure but never touch. Below it, VM scale sets in your subscription, one per node pool.

Pools come in two modes, and the distinction earns its keep. A system pool exists to host the cluster's own plumbing: CoreDNS, metrics-server, the CSI drivers, the pieces of kube-system that everything else depends on. A user pool hosts your applications. Every cluster needs at least one system pool, and the well-worn practice is to keep it small, stable, and boring — two or three modest on-demand nodes — and taint it with CriticalAddonsOnly=true:NoSchedule so your workloads cannot land there. The reasoning is plain self-defence. If a memory-hungry app pod gets scheduled next to CoreDNS and triggers node pressure, the eviction that follows can take cluster DNS down with it, and a cluster without DNS fails in ways that look like everything is broken at once. Isolating the system pods means your blast radius for a misbehaving app is the app's own pool.

From there, you add user pools by workload shape rather than cramming everything into one. A pool of general-purpose VMs for stateless services. A pool of memory-optimised VMs, tainted, for the JVM heap monsters. A GPU pool for inference, tainted hard so nothing else wastes a GPU node. And very commonly a spot pool: VMSS spot instances at a steep discount, evictable whenever Azure wants the capacity back. AKS applies the taint kubernetes.azure.com/scalesetpriority=spot:NoSchedule to spot nodes automatically, so only workloads that explicitly tolerate eviction will schedule there — batch jobs, queue workers, CI runners, anything stateless and retryable. Spot pools are the cheapest compute in the cluster and the easiest cost win available, provided you respect the contract: a spot node gets roughly thirty seconds of notice before it disappears.

Rule of thumb. One small tainted system pool, one or more user pools split by VM shape, and a spot pool for anything that can die and retry. Resist the single giant pool; it forces every workload onto one VM size and turns every node problem into a cluster problem.

The CNI decision, and the IP trap

Every pod needs an IP address, and how AKS hands them out is the decision most likely to bite you a year after cluster creation. There are three options, and the difference between them is where pod IPs come from. (For how the CNI fits into the rest of the stack — kube-proxy, services, the pod network — see Kubernetes internals; for Azure's network primitives, the VNet page.)

Kubenet, the legacy option, gives nodes real VNet IPs but pods get addresses from a private overlay range. Traffic leaving a pod is NATed through the node, and a route table on the subnet tells Azure which node owns which pod range. It is frugal with VNet addresses, but the route table tops out around 400 nodes, pods are not directly addressable from the VNet, and Microsoft has marked it for retirement. Do not build anything new on it.

Azure CNI, the traditional flat mode, gives every pod a real IP from your VNet subnet. Pods are first-class network citizens — anything in the VNet, or peered to it, or on the other end of a VPN, can reach a pod directly with no NAT. The price is address consumption, and this is the classic AKS gotcha. Each node pre-reserves IPs for its maximum pod count at provision time, not as pods actually appear. With the default of 30 pods per node, every node claims 31 addresses the moment it joins. A /24 subnet, 251 usable addresses, supports about eight nodes — and the failure arrives later, when the cluster autoscaler tries to add node nine during a traffic spike and the subnet has nothing left to give. The error says the scale set could not allocate addresses; the real cause is a subnet sized for the cluster you launched, not the one you grew into. Resizing a subnet under a live cluster is painful enough that teams have rebuilt clusters to escape it.

Azure CNI Overlay is the current default answer and keeps the good parts of both. Nodes take VNet IPs, while pods draw from a private overlay CIDR (a /16 or larger, reused across clusters) that costs your VNet nothing. Each node gets a /24 carved from the overlay space, so pod density no longer touches subnet maths, and clusters scale to thousands of nodes without route-table tricks. The trade-off is that pod IPs are not directly routable from outside the cluster — traffic from the VNet reaches pods through services and load balancers instead. For the rare workload where something outside the cluster must dial a pod IP directly (some service meshes, some legacy discovery schemes), flat Azure CNI is still the tool. For everything else, Overlay removes the trap entirely.

Three answers to "where do pod IPs come from." Flat Azure CNI pre-reserves max-pods addresses per node at provision time, which is the trap; Overlay moves pods off the subnet's books.

Interview-grade detail. The flat-CNI reservation happens per node at scale-up, sized by --max-pods, regardless of how many pods actually run. IP exhaustion therefore appears as a node-provisioning failure during autoscale, often months after the cluster was built and usually during the traffic event that needed the capacity. Sizing the subnet means multiplying max node count by max-pods plus one, then adding headroom for surge upgrades, which temporarily run extra nodes.

Who are you? Entra ID and Azure RBAC

A stock Kubernetes cluster authenticates people with client certificates baked into kubeconfig files, which is to say: long-lived credentials with no MFA, no revocation story, and no audit trail tied to a human. AKS replaces this with Entra ID. With Entra integration enabled, az aks get-credentials writes a kubeconfig that contains no secrets at all — just an instruction to run kubelogin, which walks you through the standard Entra sign-in (browser, MFA, conditional access, all of it) and hands the resulting OIDC token to the API server. The API server validates the token against your tenant. Disable an employee's Entra account and their cluster access dies with it, the same hour, across every cluster in the tenant. That single property is most of the argument for the integration.

Authentication settled, authorisation has two flavours. You can keep native Kubernetes RBAC and write RoleBindings against Entra group object IDs, which works but leaves you maintaining YAML full of GUIDs. Or you can enable Azure RBAC for Kubernetes, which moves authorisation into Azure's own role system: built-in roles such as Azure Kubernetes Service RBAC Reader, Writer, Admin, and Cluster Admin, assignable at the scope of a cluster or a single namespace with ordinary az role assignment create commands. Access reviews, PIM just-in-time elevation, and the rest of the Entra governance machinery then apply to Kubernetes access the same way they apply to storage accounts. Most organisations already deep in Azure pick this second mode, pair it with --disable-local-accounts so the certificate-based admin backdoor stops existing, and treat cluster access as just another Azure role assignment.

Workload identity: pods without secrets

People were the easy half. The harder question is how a pod calls Azure — reads a secret from Key Vault, writes a blob, talks to a database with Entra auth — without anyone pasting a client secret into the cluster. The old answer, aad-pod-identity, intercepted calls to the VM metadata endpoint with custom resources and a node-level component; it was fragile under load and is deprecated. The current answer, workload identity federation, is cleaner because it leans on a standard: OIDC token exchange.

The pieces fit together like this. AKS can publish an OIDC issuer URL — a public endpoint where Entra can fetch the cluster's token-signing keys. You create a user-assigned managed identity and add a federated credential to it, which is a statement of trust: tokens from this cluster's issuer, for the service account system:serviceaccount:myapp:my-sa, may be exchanged for this identity. In the cluster, you annotate that service account with the identity's client ID and label the pod azure.workload.identity/use: "true". A mutating webhook then projects a short-lived service account token into the pod and sets the environment variables the Azure SDKs already know to look for.

Workload identity federation. The pod's projected service account token is exchanged at Entra for an Azure access token, scoped to one managed identity, validated against the cluster's OIDC issuer.

At runtime the SDK presents the projected Kubernetes token to Entra, Entra checks the signature against the cluster's published keys and the subject against the federated credential, and returns a short-lived Azure access token for the managed identity. The pod then calls Key Vault or storage like any other Azure client. Nothing long-lived is stored in the cluster, rotation is automatic because both tokens expire in minutes, and the blast radius of a compromised pod is one identity's RBAC assignments. If you are starting an AKS project in 2026 and a tutorial mentions aad-pod-identity or client secrets in Kubernetes secrets, close the tab.

Scaling: the autoscaler and KEDA

Scaling in AKS happens at two levels, and they solve different problems. The cluster autoscaler works at the node level: when pods sit unschedulable because no node has room, it raises the backing VMSS capacity within the min and max you set per pool; when nodes sit underused for long enough, it drains and removes them. It is reactive by design — pods must already be pending before it acts — so a scale-up costs you the node boot time, a couple of minutes, and your pod resource requests must be honest or the autoscaler's arithmetic is fiction. Each pool scales independently, which is another argument for splitting pools by workload: the spot pool can swing from zero to forty while the system pool never moves.

Pod-level scaling is the Horizontal Pod Autoscaler's job, and the HPA's weakness is that it only natively understands CPU and memory. Most real scaling signals are not CPU: depth of a Service Bus queue, lag on an Event Hubs partition, rows in a backlog table. KEDA (Kubernetes Event-Driven Autoscaling) closes that gap, and AKS ships it as a managed add-on you enable with a flag. KEDA watches external sources through fifty-plus scalers, drives the HPA with what it finds, and — its best trick — scales a workload to zero when the source is idle, then back up when the first message lands. The pairing is the standard pattern: KEDA decides how many pods the queue justifies, the pile of pending pods triggers the cluster autoscaler, the cluster autoscaler buys spot nodes to hold them, and the whole chain unwinds to zero pods and zero spot nodes when the queue drains. Batch processing that costs nothing at rest is an entirely realistic AKS architecture.

Upgrades without drama

AKS upgrades are two separate motions, and conflating them causes most upgrade anxiety. A Kubernetes version upgrade moves the control plane (Microsoft's side, invisible to you beyond a brief API server blip) and then, if you ask, the node pools, which must stay within two minor versions of the control plane. A node image upgrade changes nothing about Kubernetes: it replaces each node's OS image with a newer one carrying kernel patches and CVE fixes. Node image upgrades are the routine one — weekly images are published — and you can automate them with a maintenance window and the node-image auto-upgrade channel, which is the closest thing AKS has to "unattended security patching done right."

Both motions replace nodes the same way: cordon, drain, delete, create. The max surge setting on each pool controls how aggressively. With surge at the default of one, AKS adds one extra node, drains one old node into it, and walks the pool one node at a time — gentle, slow, and fine for small pools. Set surge to 33% on a big pool and a third of the pool upgrades in parallel; production guidance sits around there, with the reminder that surged nodes are real VMs needing real subnet IPs, which is one more reason the flat-CNI subnet maths above included headroom. Pod disruption budgets are honoured during drains, so a PDB that permits zero disruption will stall an upgrade forever — AKS eventually gives up and reports the drain failure rather than violating the budget. Maintenance windows (az aks maintenanceconfiguration) pin all of this to hours you choose, and auto-upgrade channels (patch, stable, rapid, node-image) decide how much of the treadmill runs itself. The practical floor for a production cluster: an auto-upgrade channel for node images, planned minor-version upgrades a couple of times a year, and PDBs that are tested, not aspirational.

Seeing inside: managed Prometheus and Container insights

Observability on AKS has converged on two managed pipelines. Managed Prometheus (Azure Monitor workspace under the hood) scrapes the standard Prometheus targets — kubelet, cAdvisor, kube-state-metrics, node exporter, plus your own pods via pod annotations or scrape configs — without you running or storing Prometheus yourself, and feeds Azure Managed Grafana, where the usual Kubernetes dashboards arrive prebuilt. Container insights is the log half: an agent on each node ships container stdout and stderr plus Kubernetes events into a Log Analytics workspace, where KQL queries can answer "what did this pod log in the minute before it restarted" months after the fact. Metrics in Prometheus format, logs in Log Analytics, dashboards in Grafana, alerts in Azure Monitor — none of it is exotic, which is the point. The one decision worth making early is log volume: Container insights priced over a chatty cluster's full stdout is a real line item, and the agent's data collection settings (namespaces to exclude, collection interval) deserve five minutes of thought before the first invoice rather than after.

Private clusters

By default the API server has a public address, protected by Entra auth and optionally an IP allowlist. A private cluster removes the public address entirely: the API server is reachable only through a private endpoint inside your VNet, with a private DNS zone resolving the cluster FQDN to it. Nodes talk to the control plane over that link, and so must everything else — which is the operational catch. Your laptop cannot run kubectl against a private cluster without being network-adjacent: VPN, ExpressRoute, a jump box in a peered VNet, or az aks command invoke, which tunnels single commands through the AKS API as a managed escape hatch. CI/CD needs the same treatment, usually self-hosted runners inside the VNet. The middle path, API server VNet integration with a public endpoint plus authorised IP ranges, keeps casual internet traffic out without rebuilding your tooling, and is honestly where many teams should stop. Choose full privacy when compliance demands it, and budget for the plumbing it drags in.

AKS next to GKE and EKS

An honest placement, since everyone asks. GKE is the most automated of the three: Google ran Kubernetes-shaped infrastructure before Kubernetes existed, Autopilot mode removes node management entirely, and features tend to land there first and most polished. EKS is the most assembly-required: AWS gives you a sturdy control plane and expects you to bolt on networking add-ons, IAM plumbing, and (until Auto Mode's recent arrival) your own node lifecycle answers — fine for teams that want the control, slow for teams that do not. AKS sits between them: more managed than EKS out of the box, with node pools, autoscaling, KEDA, and monitoring arriving as flags rather than projects, while stopping short of GKE's hands-off Autopilot. Its differentiating edge is identity — no other provider's directory integration goes as deep as Entra ID for cluster auth, Azure RBAC down to namespace scope, and workload identity federation wired through the same tenant — which is why organisations already living in Entra usually find AKS the path of least resistance, and organisations that are not rarely choose Azure for Kubernetes alone. For the AWS side of this comparison, see containers on AWS.

CLI lab: a real cluster, start to teardown

Twenty minutes, one resource group, and everything above made concrete. You need the az CLI logged in, kubectl, and kubelogin (az aks install-cli installs both). Costs are a few cents if you tear down at the end; the VM sizes here are deliberately small.

1. Create the cluster with a tainted system pool. Standard tier, CNI Overlay, Entra auth with Azure RBAC, autoscaler on, local accounts off:

az group create --name aks-lab --location westeurope

az aks create \
  --resource-group aks-lab \
  --name lab \
  --tier standard \
  --network-plugin azure \
  --network-plugin-mode overlay \
  --nodepool-name system \
  --node-count 1 \
  --node-vm-size Standard_B2s \
  --nodepool-taints CriticalAddonsOnly=true:NoSchedule \
  --enable-cluster-autoscaler --min-count 1 --max-count 2 \
  --enable-aad --enable-azure-rbac \
  --disable-local-accounts \
  --generate-ssh-keys

While it provisions (about five minutes), look at az group list -o table and find the MC_aks-lab_lab_westeurope group AKS created. The VMSS for your system pool lives there; the control plane appears nowhere, because it is not in your subscription.

2. Add a spot user pool. Different VM size, spot priority, scale-to-zero minimum:

az aks nodepool add \
  --resource-group aks-lab \
  --cluster-name lab \
  --name spotpool \
  --mode User \
  --priority Spot --eviction-policy Delete --spot-max-price -1 \
  --node-vm-size Standard_D2s_v3 \
  --enable-cluster-autoscaler --min-count 0 --max-count 3 \
  --node-count 1

az aks nodepool list -g aks-lab --cluster-name lab \
  -o table --query "[].[name,mode,scaleSetPriority,count]"

3. Get credentials through Entra. Note what lands in your kubeconfig — an exec plugin, not a certificate:

az aks get-credentials --resource-group aks-lab --name lab
kubelogin convert-kubeconfig -l azurecli

kubectl get nodes -o wide

The first kubectl call may bounce you through an Entra sign-in. If it then returns a permissions error, that is Azure RBAC doing its job — grant yourself a role and retry:

az role assignment create \
  --assignee "$(az ad signed-in-user show --query id -o tsv)" \
  --role "Azure Kubernetes Service RBAC Cluster Admin" \
  --scope "$(az aks show -g aks-lab -n lab --query id -o tsv)"

Inspect the taints: kubectl describe node | grep -i taint shows CriticalAddonsOnly on the system node and the scalesetpriority=spot taint AKS added to the spot node for you.

4. Deploy onto the spot pool and scale it. The toleration and selector steer the workload to spot:

kubectl create deployment hello \
  --image=mcr.microsoft.com/azuredocs/aks-helloworld:v1 --replicas=2

kubectl patch deployment hello --type=json -p '[
  {"op":"add","path":"/spec/template/spec/tolerations","value":
    [{"key":"kubernetes.azure.com/scalesetpriority","operator":"Equal",
      "value":"spot","effect":"NoSchedule"}]},
  {"op":"add","path":"/spec/template/spec/nodeSelector","value":
    {"kubernetes.azure.com/scalesetpriority":"spot"}}]'

kubectl scale deployment hello --replicas=12
kubectl get pods -o wide -w

Watch the pending pods pile up, then watch the cluster autoscaler raise the spot pool — kubectl get nodes -w shows new nodes joining within a few minutes. Scale back to 1 replica afterwards and, given ten quiet minutes, the autoscaler walks the spot pool back down. That round trip is the entire economics of spot pools in one terminal.

5. Tear it down. One group deletion removes the cluster, and deleting the cluster removes its MC_ group:

az group delete --name aks-lab --yes --no-wait

If you want one more experiment before deleting: az aks nodepool upgrade with --node-image-only on the spot pool shows the surge-and-drain choreography from the upgrades section live, node by node.

Up next

06 — Functions

Azure's serverless compute: hosting plans and cold starts, triggers and bindings, Durable Functions, and when a function beats a container.

Continue

← Back to Azure ↑ The codex

Found this useful?