GKE

Kubernetes came out of Google, and GKE is the version Google runs for you. That sentence carries more weight than it sounds: the control plane is operated by the people who wrote the scheduler, the networking maps onto the same VPC fabric the rest of the cloud uses, and with Autopilot you can stop thinking about nodes entirely. This page covers what the managed control plane actually does, the Autopilot versus Standard decision, node pools, VPC-native pod networking, Workload Identity, release channels, the three autoscalers, and what all of it costs. It ends with a lab you can run in twenty minutes.

The Borg lineage, and why it matters

Kubernetes did not appear from nowhere. It was designed by engineers who had spent a decade running Borg, the internal cluster manager that has scheduled essentially all of Google's production workloads since the mid-2000s, and its successor research system Omega. When Kubernetes was open-sourced in 2014, it was the third draft of ideas Google had already operated at planetary scale: declarative desired state, a reconciling control loop, labels instead of hierarchies, pods as the unit of scheduling. If you want the mechanics of those ideas, the Kubernetes internals series walks through them in detail; this page is about what changes when Google hosts them for you.

GKE shipped in 2015 as the first managed Kubernetes service, a full two years before EKS and AKS existed, and that head start shows in the details. Features tend to land on GKE first because many of the upstream maintainers work at Google. The control plane is operated by an SRE organisation that has been running this exact software, in one form or another, longer than anyone else. And the integration with the rest of the platform is unusually deep: pod IPs are real VPC addresses, Kubernetes Services map onto Google's production load balancers, and pod identity plugs straight into IAM. None of this makes GKE magic, but it is why "GKE is the best managed Kubernetes" is a common opinion even among engineers who otherwise prefer AWS. If you want the comparison from the other side, the AWS containers page covers EKS and the trade-offs teams weigh there.

The practical consequence of the lineage is a particular attitude: GKE assumes you want the platform to do more, not less. Node auto-repair and auto-upgrade default to on. Release channels exist so you never hand-pick versions. Autopilot goes furthest and removes nodes from your mental model entirely. The rest of this page is structured around the layers of that hand-off: first what Google always runs, then the big decision about how much of the rest you want to keep.

The managed control plane

Every Kubernetes cluster has two halves: a control plane that decides what should run where, and worker nodes that run it. The control plane is the API server your kubectl talks to, the etcd database holding all cluster state, the scheduler assigning pods to nodes, and the controller managers reconciling actual state toward desired state. The architecture page dissects each of these; the point here is that in GKE you never see any of them as machines. They run in a Google-owned project on Google-managed VMs, and your only interface to them is the Kubernetes API endpoint and the GKE API itself.

That hand-off is bigger than it sounds. Operating etcd well is the hardest part of running Kubernetes yourself: it needs fast disks, careful compaction, regular defragmentation, backups, and a quorum that survives zone failures. GKE does all of that invisibly, encrypts etcd at rest, scales the control plane VMs vertically as your cluster grows, patches CVEs, and upgrades the masters with no action from you. When people say managed Kubernetes is worth the fee, this is mostly what they mean.

A regional cluster: control plane replicas in three zones, run by Google. Node pools are the only machines that live in your project.

The one control-plane decision you do make is its shape. A zonal cluster has a single control plane in a single zone. It is cheaper to qualify for the free tier and fine for experiments, but when that zone has a bad day, or the control plane is being upgraded, the API goes unavailable. Your pods keep serving traffic, because running containers do not need the API to stay alive, but nothing can be deployed, scaled, or rescheduled until it returns. A regional cluster replicates the control plane across three zones behind one endpoint, survives a zone outage, upgrades without API downtime, and carries the higher SLA. For anything you would page someone about, choose regional. Autopilot makes the choice for you: every Autopilot cluster is regional.

Autopilot or Standard: the big decision

Once Google runs the control plane, the remaining question is who runs the nodes, and this is the first real choice you make when creating a cluster. Standard mode is classic managed Kubernetes: you define node pools, pick machine types, set autoscaler bounds, and pay for the VMs whether pods fill them or not. Autopilot removes the node layer from your view. You submit pods with resource requests; GKE provisions whatever compute they need, schedules them, and bills you for the requests themselves, per pod, per second. There are nodes underneath, you can even list them with kubectl, but they are Google's problem: their size, their lifecycle, their patching, their bin-packing.

The same stack, two splits. Autopilot moves every layer below your manifests onto Google's side of the line.

The constraints are the price of the abstraction. Autopilot rejects privileged pods, host network, and hostPath volumes except for a vetted set of partner agents, because anything that touches the node breaks the model where the node belongs to Google. You cannot SSH to nodes. Resource requests are subject to minimums and ratios, and if your container declares no requests, defaults are imposed, which means a sloppy manifest costs real money instead of silently squatting on shared capacity. DaemonSets are allowed but billed like everything else, so a heavyweight per-node logging agent shows up on the invoice. GPU and spot workloads are supported, but through specific workload classes rather than hand-built node pools.

The honest decision rule: start with Autopilot unless you can name the thing it will not let you do. Teams that need Standard usually know exactly why — a kernel module, a custom CNI, a per-node cache that wants hostPath, very specific machine shapes for license reasons, or fleet-wide bin-packing they believe they can do more cheaply than the per-request premium. Everyone else is buying back operations time, and the failure mode of choosing Standard "to keep options open" is a year of node pool upkeep nobody enjoyed. It is also worth asking the question one level up: if your service is a stateless container behind HTTP, Cloud Run may make the whole cluster unnecessary. Autopilot is for when you want Kubernetes semantics without node operations; Cloud Run is for when you do not want Kubernetes at all.

Node pools, if you choose Standard

A node pool is a group of identical VMs managed as one unit: same machine type, same disk, same labels and taints, same autoscaler bounds. A cluster can hold many pools, and that is how you express heterogeneity. A typical production cluster might run a general pool of e2-standard-4 machines for stateless services, a high-memory pool for the JVM workloads, a GPU pool that scales from zero for inference jobs, and a spot-VM pool for batch work that tolerates eviction. Pods steer themselves with node selectors and tolerations: the GPU pool carries a taint so only pods that explicitly tolerate it land there, which keeps a stray web replica from occupying a machine that costs ten times as much.

Each pool autoscales independently. You set a minimum and maximum node count, and the cluster autoscaler grows the pool when pods are pending and shrinks it when nodes sit underused. Scaling to zero is allowed and is the normal pattern for expensive accelerators. Two per-pool settings deserve respect rather than reflexive disabling. Auto-repair watches node health and recreates nodes that fail their checks, which quietly absorbs the steady background rate of VM failures any large fleet has. Auto-upgrade keeps node kubelets in step with the control plane version, draining and replacing nodes one at a time within maintenance windows you define. Both default to on, and both exist because the alternative, a human doing the same dance by hand across hundreds of nodes, is where self-managed clusters go to rot. If your workloads have a correct PodDisruptionBudget, upgrades are a non-event; if they do not, the budget is the thing to fix, not the automation.

The operational sharp edge in Standard is request hygiene. The scheduler places pods by their declared requests, the autoscaler buys nodes by the same numbers, and the bill follows the VMs. Requests set far above real usage inflate the fleet; requests set too low overcommit nodes and produce mysterious evictions under pressure. Standard gives you the controls and therefore the consequences; this single feedback loop is most of what Autopilot is selling to make go away.

VPC-native networking: pods get real IPs

Most Kubernetes installations run an overlay: pod IPs live in a private range the underlying network knows nothing about, and traffic between nodes is encapsulated or NATed at the edge. GKE clusters are VPC-native instead. Pod IPs are allocated out of your VPC subnet and are first-class addresses on the network fabric, reachable from other VMs in the VPC, across peering, and over VPN or Interconnect, with no encapsulation and no NAT in the path. How subnets, routes, and peering behave generally is covered in VPC networking; the GKE-specific part is how the addresses are carved up.

A VPC-native cluster uses one subnet with three ranges. The primary range supplies node IPs, exactly as it would for any VM. Two secondary ranges supply pod IPs and Service ClusterIPs. When a node joins the cluster, GKE assigns it a slice of the pod range, by default a /24 per node to comfortably host the default ceiling of 110 pods, and attaches that slice to the node's network interface as an alias IP range. Alias ranges are the mechanism that makes this work: the VPC's SDN knows that this whole /24 lives behind this specific VM, so packets for any pod on the node route straight to it. No per-node custom routes, no route table limits, no overlay.

VPC-native allocation: nodes from the primary range, each node holding a /24 alias slice of the pod secondary range.

The planning consequence is that IP space becomes a capacity decision you make on day one. A /24 per node means the pod secondary range caps your node count: a /16 pod range supports about 256 nodes at default density, and resizing ranges after the fact is painful. Teams carving many clusters out of a shared corporate address plan often lower max pods per node to make each node consume a /25 or /26 instead. It feels like over-planning until the day a cluster cannot scale because its pod range is exhausted, which is an unpleasant place to be.

From Service to Google load balancer

Because pods have VPC addresses, Kubernetes networking objects can map onto Google's real load balancers instead of emulating one. A Service of type LoadBalancer provisions a passthrough Network Load Balancer: regional, layer 4, forwarding TCP or UDP to your nodes or pods. An Ingress (or, on current clusters, the Gateway API, which GKE backs with the same infrastructure and is where new work should go) provisions an external Application Load Balancer: the same global anycast HTTP(S) front end that serves google.com, with TLS termination, URL routing, Cloud Armor policies, and CDN integration available by annotation. You write portable Kubernetes YAML; the controller translates it into forwarding rules, backend services, and health checks in your project.

The detail worth knowing is container-native load balancing through network endpoint groups, or NEGs. The naive path sends load balancer traffic to a node port, where kube-proxy forwards it to some pod, often on a different node, adding a hop and hiding the client's view of which backend is actually healthy. With NEGs, the load balancer holds the pod IPs themselves as its backends, health-checks each pod directly, and delivers traffic in one hop. On VPC-native clusters this is the default for new Services behind an Application Load Balancer, and it is one of those features that quietly explains why tail latencies on GKE often look better than an equivalent setup elsewhere.

Workload Identity: pods get IAM identities, not key files

Sooner or later a pod needs to call a Google API: read a bucket, publish to Pub/Sub, query BigQuery. The wrong answer is exporting a service account JSON key and mounting it as a Kubernetes secret. Keys are long-lived bearer credentials; they leak through CI logs, laptop backups, and copied manifests, and revoking one is a fire drill. Workload Identity exists so you never create the file at all.

The mechanism: enabling Workload Identity gives the cluster an identity namespace of the form PROJECT_ID.svc.id.goog, and every Kubernetes service account in the cluster becomes a principal Google IAM can reason about. You bind a Kubernetes service account to a Google service account with a single IAM policy, annotate the Kubernetes account with its Google counterpart, and run your pods under it. Inside the pod, a GKE-managed metadata server intercepts the SDK's routine credential lookup and exchanges the pod's Kubernetes token for a short-lived OAuth token belonging to the Google service account. Your application code is unchanged; the default credential chain simply works. Recent GKE goes a step further and lets IAM bind roles directly to the Kubernetes service account principal, skipping the intermediate Google service account entirely.

The rule of thumb. If a manifest in your repo contains a base64 blob that starts with ewog, that is a JSON key pretending to be configuration. Workload Identity is the supported replacement, it costs nothing, and on Autopilot it is on by default. There is no remaining good reason to mount key files in GKE.

Release channels: someone else picks the version

Kubernetes releases three minor versions a year and supports each for about fourteen months, which means version management is a treadmill, not a task. GKE's answer is release channels. You enrol a cluster in Rapid, Regular, or Stable, and Google qualifies versions, promotes them through the channels, and upgrades both your control plane and your nodes automatically within maintenance windows you set. Rapid tracks new minors within weeks of upstream and is for teams who need a new feature or like the bleeding edge. Regular, the default, lags by a couple of months of soak time. Stable waits longest and moves slowest. There is also an Extended channel that holds old minors past normal support for a premium, for the workloads with awkward dependencies.

You can pin exact versions and exclude windows around peak events, but the channel model is deliberately opinionated: clusters should never fall far enough behind that upgrading becomes a project. The practical guidance is dull and correct. Run Regular in production, define maintenance windows that match your traffic, keep PodDisruptionBudgets accurate so node drains are routine, and read the release notes for deprecated APIs before each minor lands. Teams that treat upgrades as continuous background noise spend dramatically less time on them than teams that batch three minors into one weekend.

Three autoscalers, three different questions

GKE discussions blur three scaling mechanisms together, and they answer different questions. The Horizontal Pod Autoscaler answers "how many replicas of this workload?" It watches CPU, memory, or custom metrics and adjusts the replica count of a Deployment. The Vertical Pod Autoscaler answers "how big should each replica be?" It observes actual usage and recommends, or rewrites, the resource requests, which matters because requests drive both scheduling and, on Autopilot, the bill directly. The cluster autoscaler answers "how many nodes does all of this need?" It exists to serve the first two: when the HPA adds replicas and no node has room, pods go pending, and pending pods are the signal that grows a node pool.

They compose into one chain. Traffic rises, the HPA scales the Deployment from ten replicas to thirty, the scheduler places what fits, the remainder sit pending, the cluster autoscaler adds nodes from whichever pool matches, and the pending pods land. On the way down the chain runs in reverse, with the autoscaler draining and deleting underused nodes. The classic failure is a fight between layers: a VPA inflating requests while the HPA is scaling on utilisation of those same requests can oscillate, which is why the standard advice is HPA for replica count and VPA in recommendation mode until you understand your workload's real shape. On Autopilot the third question disappears, since capacity tracks pending pods automatically, and Standard offers node auto-provisioning, which lets the autoscaler create whole new node pools with machine shapes inferred from pending pods, rather than only resizing pools you defined.

What it costs

GKE pricing has two layers: a flat management fee and the compute underneath. The management fee is $0.10 per cluster per hour, about $73 a month, regardless of mode or size, and every billing account gets a free-tier credit that covers the fee for one zonal or one Autopilot cluster. The fee is trivial against any real workload, but it does mean cluster sprawl has a floor cost, and it is one of several reasons to prefer fewer, larger, multi-tenant clusters over a cluster per team.

Below the fee, the modes diverge. Standard bills you for the node VMs at ordinary Compute Engine prices, including sustained and committed use discounts and spot rates. Utilisation is your problem: if your pods request half a node and the other half sits idle, you pay for the idle half. Autopilot bills per pod, on the vCPU, memory, and ephemeral storage your pods request, per second. The unit prices are higher than the equivalent raw VM, which is the premium for Google absorbing the bin-packing and node operations, but there is no idle capacity on your bill at all. The crossover is a utilisation question: a team that keeps Standard nodes consistently busy can beat Autopilot on raw cost, while a team running spiky or modest workloads on half-empty nodes is usually paying more for Standard plus the engineer-hours of tending it. Be honest about the engineer-hours; they are the part that never shows up on the invoice but always shows up somewhere.

Lab: an Autopilot cluster from zero

Twenty minutes, one region, and everything torn down at the end. You need the gcloud CLI authenticated against a project with billing enabled, and kubectl, which gcloud can install as a component. The cluster creation step is the slow one; expect five to ten minutes.

Enable the API and create the cluster. One flag separates Autopilot from Standard: create-auto. No machine types, no node counts, no pools.
gcloud config set project YOUR_PROJECT_ID gcloud services enable container.googleapis.com gcloud container clusters create-auto gke-lab \ --region us-central1 # Wire kubectl to the new cluster gcloud container clusters get-credentials gke-lab --region us-central1 kubectl get nodes
That last command is worth a pause: Autopilot has already provisioned a small set of nodes you never asked for, sized for system pods. You can see them, but you cannot SSH to them and you are not billed for them as VMs.
Deploy a sample workload. Google publishes a tiny hello-world server for exactly this purpose. Give it explicit requests, because on Autopilot requests are the bill.
kubectl create deployment hello-web \ --image=us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0 kubectl set resources deployment hello-web \ --requests=cpu=250m,memory=256Mi kubectl scale deployment hello-web --replicas=3 kubectl get pods --watch
The first pods may sit Pending for a minute or two. That pause is the entire Autopilot model in one observation: pending pods triggered node provisioning, and you did nothing.
Expose it through a load balancer. A Service of type LoadBalancer makes GKE provision a passthrough Network Load Balancer with a public IP.
kubectl expose deployment hello-web \ --type=LoadBalancer --port=80 --target-port=8080 # Wait for EXTERNAL-IP to move from pending to a real address kubectl get service hello-web --watch
When the address appears, hit it:
curl http://EXTERNAL_IP # Hello, world! # Version: 1.0.0 # Hostname: hello-web-...
Run curl a few times and watch the hostname change as the load balancer spreads requests across your three pods.
Observe what got built. Look at the pods' addresses and the nodes that appeared to host them, then peek at the load balancer GKE created on your behalf.
kubectl get pods -o wide # note pod IPs from the secondary range kubectl get nodes # nodes Autopilot provisioned for your pods kubectl describe service hello-web gcloud compute forwarding-rules list # the NLB's forwarding rule, made for you
The pod IPs are real VPC addresses, the nodes appeared in response to your replicas, and the forwarding rule is a normal Google Cloud load balancer object you could inspect like any other. Nothing here is emulated.
Tear it all down. Delete the Service first so the load balancer and its external IP are released, then the cluster. Leaving either running is the classic way a lab becomes a line item.
kubectl delete service hello-web kubectl delete deployment hello-web gcloud container clusters delete gke-lab \ --region us-central1 --quiet

What you just did by hand, create, deploy, expose, observe, destroy, is the loop every GKE deployment pipeline automates. The parts that were invisible, etcd, the scheduler, node provisioning, the load balancer plumbing, are the parts this page was about.

Up next

05 — Cloud Run

When you do not want a cluster at all: request-driven containers, scale to zero, and where the serverless model beats Kubernetes on both effort and cost.

Continue

← Back to GCP ↑ The codex

Found this useful?