05 / 09

GCP / 05

Cloud Run

Cloud Run makes one promise: hand it any container that listens on a port, and it will run that container for you, scale it from zero to thousands of copies based on incoming requests, and bill you only while requests are being handled. No language runtime list, no function signature, no cluster to manage. The interesting engineering lives in how it scales — on concurrent requests, with each instance serving many at once — and that single design choice changes the cost maths, the cold-start maths, and the mental model you should carry, especially if you arrive from Lambda.

A container, not a function

Most serverless platforms started from the function. You write a handler with a prescribed signature, in one of a short list of supported runtimes, package it the platform's way, and the platform owns everything around it: the process model, the web server, the request parsing. AWS Lambda is the canonical example, and the Lambda execution model page covers what that shape buys and costs you. Cloud Run started from the other end. Its unit of deployment is a container image, and the contract is almost embarrassingly small: your container must start an HTTP server that listens on the port given in the PORT environment variable (8080 by default), and it must be prepared to be killed at any time. That is the whole interface.

The consequences of that contract are larger than they look. Any language works, including ones Google has never heard of, because the platform never calls into your code — it just sends HTTP to a port. A twenty-year-old Java service, a Rust binary, a Flask app, an nginx reverse proxy, a compiled Fortran model behind a tiny Go shim: if it can be put in a container and answer HTTP, it can run. Your web framework runs as itself, with its own routing and middleware, rather than being dismembered into per-route functions. Local development is honest, because the thing you run on your laptop with docker run is byte-for-byte the thing that runs in production. And there is no platform lock-in at the code level: the same image runs on Cloud Run today and on a Kubernetes cluster tomorrow.

The trade against function-shaped serverless is mostly about who owns the server. With a function platform, you cannot get the web server wrong because you never see it. With Cloud Run, you bring your own, which means you can also bring your own mistakes: a server that takes thirty seconds to boot, a framework configured for four worker threads when the instance will receive eighty concurrent requests, a process that buffers uploads into memory. Cloud Run gives you the freedom of a real process and quietly expects you to know how to run one.

The Knative heritage

Cloud Run did not appear from nowhere. It is Google's managed implementation of the Knative serving model, an open-source layer that Google and others built on top of Kubernetes to describe request-driven, scale-to-zero workloads. Knative defines the vocabulary Cloud Run still speaks: a service is the named thing you deploy, every deployment produces an immutable revision, and a traffic block on the service says what percentage of requests each revision receives. If you export a Cloud Run service definition with gcloud run services describe --format=yaml, what comes back is a Knative-shaped resource, complete with apiVersion: serving.knative.dev/v1.

This heritage matters for two practical reasons. First, portability: a workload described as a Knative service can run on the fully managed Cloud Run, or on your own GKE cluster with Knative installed, with the same YAML and largely the same behaviour. Teams that outgrow the managed platform's constraints have a real exit that does not involve rewriting anything. Second, the concepts are not proprietary inventions you have to memorise per vendor — revisions, traffic splitting, and concurrency targets are ideas from an open spec, and understanding them once pays off across platforms.

It is worth being precise about what the managed product actually is, though. Fully managed Cloud Run does not run your container on a Kubernetes cluster you could ever see. It runs on Google's internal serving infrastructure (sandboxed with gVisor on first-generation instances, a full microVM on second-generation), and Knative is the API shape, not the implementation. You get the Kubernetes-adjacent model without the cluster, the node pools, the upgrades, or the bill for idle nodes.

The request-driven model: scaling on concurrency, not CPU

Classic autoscaling watches resource signals. A VM group or a Kubernetes horizontal pod autoscaler looks at CPU load, decides the fleet is running hot, and adds capacity — a signal that lags the actual demand and says nothing about whether requests are queueing. Cloud Run scales on a much more direct signal: the number of requests currently in flight. Every service has a concurrency setting — the maximum number of requests one instance may handle simultaneously — and the scheduler's job is simple arithmetic: keep enough instances running that in-flight requests divided by instances stays at or below the target. Sixty concurrent requests against a concurrency of 80 needs one instance. Eight hundred needs ten. Zero needs zero.

That last case is the headline feature. When no requests have arrived for a while, Cloud Run stops the last instance, and the service costs nothing at rest beyond storage for the image. The first request after that idle period triggers a cold start — the platform pulls and starts a container, waits for it to listen on the port, and then delivers the request. Between zero and the maximum (configurable, with a default ceiling of 100 instances and quota above that), the instance count tracks load with a granularity of one container, scaling out in seconds rather than the minutes a VM-based group needs.

Instances (boxes, bottom) track in-flight requests (curve), one new instance per 80 concurrent requests at the default setting. Quiet service, zero instances, zero compute cost.

Notice what the signal is not. It is not requests per second, and it is not CPU. A service receiving 1,000 requests per second that each finish in 10 ms has only about ten requests in flight at any moment — one instance, comfortably. A service receiving ten requests per second that each take 30 seconds has 300 in flight and needs four instances at the default concurrency. In-flight count is throughput multiplied by latency (Little's law, doing quiet work in the background), and it is the only signal that directly answers the question the scaler cares about: is there a request right now with nowhere to go?

Concurrency: the setting that changes everything

Here is the single biggest mental-model difference from Lambda, and it deserves to be stated plainly. A Lambda instance handles exactly one request at a time. A Cloud Run instance handles up to 80 by default, configurable from 1 to 1,000. If 80 requests arrive in the same moment, Lambda materialises 80 execution environments; Cloud Run starts one container and routes all 80 into it, exactly the way a normal web server on a normal machine would take them. Cloud Run is, in this sense, a platform for running ordinary multi-threaded, event-looped, connection-pooled servers — it just turns the number of them up and down for you.

The same burst of eight simultaneous requests. Cloud Run routes them into one process; Lambda gives each its own execution environment.

Work through what this does to cost. Suppose each request takes 100 ms of wall time but only 5 ms of actual CPU, because most of it is waiting on a database. On a one-request-per-instance platform you pay for 100 ms of an instance per request — the 95 ms of waiting is billed as if it were work, multiplied across every concurrent request. On Cloud Run, those waiting requests overlap inside one instance: 80 concurrent requests cost one instance-second per second, not eighty. For I/O-heavy traffic — which is most web traffic — the difference is not a few percent, it can be an order of magnitude. The flip side: if your requests are CPU-bound, the 80 requests compete for the instance's one or two vCPUs, latency climbs, and you should turn concurrency down until each request gets the CPU it needs.

Concurrency also rewrites the cold-start maths. On a 1:1 platform, a burst of N simultaneous requests against a cold service means N cold starts; on Cloud Run it means roughly N divided by 80. One container boot absorbs the whole first wave. The same logic applies to everything expensive that happens once per process: loading a model into memory, JIT warm-up, opening a database connection pool, populating an in-process cache. With concurrency, these costs are amortised over many requests per instance instead of being repaid per request — which is why connection pools and in-memory caches, nearly pointless in Lambda, work normally in Cloud Run.

What concurrency demands of your code. Your server will receive overlapping requests in one process, so it must be safe under concurrency: no mutable global state shared across requests without care, no framework configured for a single worker. Memory is also shared — 80 requests each holding a 10 MB buffer is 800 MB inside one instance. If your code cannot tolerate that (a Puppeteer renderer, an FFmpeg job, a non-thread-safe library), set --concurrency=1 and Cloud Run behaves exactly like Lambda, one request per instance, with pricing to match.

Cold starts and min-instances

A cold start on Cloud Run is the time between "a request arrived and no instance can take it" and "your server accepted the connection." The platform's share — provisioning a sandbox, fetching the image — is well optimised: images are streamed from Artifact Registry so the container can start before every layer has arrived, and the infrastructure part typically lands in the hundreds of milliseconds. The slow part is usually yours. A Go binary that listens on its port in 50 ms cold-starts almost invisibly. A Spring Boot application that spends twelve seconds wiring beans makes every cold start a twelve-second stall, and no platform setting will hide that. The first lever is therefore always the same: make the container boot fast — smaller images, less work before listen(), and the --cpu-boost flag, which grants extra CPU during startup precisely so frameworks like Spring and Rails boot in a fraction of the time.

The second lever is minimum instances. Setting --min-instances=1 (or more) tells Cloud Run to keep that many instances warm even when traffic is zero. Requests after an idle night hit a live process instead of a cold one. The trade is exactly what you would guess: you are now paying for idle capacity, which is the thing scale-to-zero existed to avoid — though idle instances are billed at a reduced rate when CPU is not allocated, so a single warm instance is cheap insurance for a latency-sensitive service. The honest framing: min-instances converts Cloud Run from "pay only for use" to "pay a small floor for predictable latency," and for anything user-facing that floor is usually worth it.

And remember the structural advantage from the previous section: because one instance absorbs up to 80 concurrent requests, cold starts are rarer per request than on a 1:1 platform to begin with. A steady trickle of traffic keeps one instance alive indefinitely, and that one instance serves the trickle entirely warm. Cold starts on Cloud Run cluster at two moments — the first request after real idleness, and the leading edge of a sharp traffic spike — and min-instances addresses the first while fast boot addresses the second.

CPU during requests, or CPU always

The default billing mode has a sharp edge worth knowing before it cuts you. With request-based CPU allocation (the default), your container only has real CPU while it is handling at least one request. The moment the last in-flight response is sent, the instance keeps existing — warm, memory intact — but its CPU is throttled to almost nothing. Any background thread you spawned, any work you queued to finish "after" the response, any metrics flush on a timer: all of it freezes mid-air until the next request wakes the instance. This is the most common Cloud Run bug in the wild — fire-and-forget work that silently never runs — and the platform is behaving exactly as designed: you are paying for request handling, so that is when you get a CPU.

The alternative is instance-based allocation (--no-cpu-throttling), where the CPU stays allocated for the instance's whole lifetime. Background work runs normally, timers fire, goroutines finish. You pay for the full lifetime of each instance rather than per request — a different, flatter price curve that ends up cheaper for services that are busy most of the time and more expensive for spiky ones. Pick request-based when your work begins and ends with the HTTP exchange; pick instance-based when the process has a life of its own. And if the "background work" is substantial — a queue consumer, a nightly batch — the better answer is usually not a mode switch but a different shape of workload entirely, which is the next section.

Revisions and traffic splitting

Every deploy to a Cloud Run service creates a new revision: an immutable snapshot of the image plus every setting — environment variables, memory, concurrency, all of it. Revisions are never edited; changing anything creates another one. The service then holds a traffic table mapping revisions to percentages, and the routing layer enforces it. By default each deploy sends 100% of traffic to the new revision, but nothing forces that: deploy with --no-traffic and the new revision sits live but unvisited, reachable through a tag URL for smoke testing, while production traffic continues to the old one untouched.

One service, two revisions, traffic split by percentage at the routing layer. Rollback is moving the numbers, not redeploying.

This turns canary releases from an infrastructure project into a flag. Ship the new revision with no traffic, verify it against its tag URL, move 10% of real traffic to it with gcloud run services update-traffic, watch error rates and latency, then walk it to 50% and 100% — or snap everything back to the previous revision in one command, because that revision still exists, unmodified, ready to serve. Rollback on Cloud Run is not "deploy the old version again and hope the build reproduces"; it is a routing change that takes effect in seconds. Old revisions that receive no traffic and have no min-instances cost nothing, so the history is effectively free to keep around.

One subtlety: a traffic percentage is evaluated per request, not per user. A browser making five requests may land on both revisions, so the canary revision must be compatible with the stable one wherever they share state — same database schema, same session format, same cookie expectations. Percentage splits are a release-safety tool; they are not sticky A/B testing, though tag URLs plus your own routing can be bent into that shape when needed.

Jobs: the other half of the platform

Everything above describes Cloud Run services: things that answer requests and live as long as requests keep coming. Cloud Run jobs are for the other shape of work — a container that starts, does something, exits, and is done. No port, no HTTP, no listening. A job runs your container's entrypoint to completion, records success or failure based on the exit code, and retries failures up to a limit you set. Jobs can fan out: ask for 50 tasks and Cloud Run starts 50 copies of your container, each with a CLOUD_RUN_TASK_INDEX environment variable so it can pick its slice of the work — a cheap way to parallelise a backfill or a batch transform without standing up any orchestration.

Jobs exist because people kept abusing services for batch work — wiring up an HTTP endpoint whose handler ran for an hour, fighting request timeouts and CPU throttling the whole way. If the work is triggered by a clock or a human rather than a request, it is a job. Cloud Scheduler can execute a job on a cron schedule, which covers the "nightly report" category that otherwise tempts people into keeping a VM around. Same images, same billing model, same deploy tooling; the only difference is that nothing needs to listen on a port.

Limits worth knowing before you commit

Cloud Run is a managed platform, and managed platforms have walls. The ones that change designs in practice: requests have a timeout, five minutes by default and one hour at maximum — anything longer must become a job or be split. The filesystem is in-memory: writable, but every byte written counts against the instance's memory limit and vanishes when the instance dies, so it is scratch space, not storage (volume mounts can attach Cloud Storage buckets or NFS when you need a filesystem). Instances top out at 8 vCPU and 32 GiB of memory on second-generation, which is plenty for a web server and not enough for some data work. Instances are stateless and disposable by contract — the platform will kill them during scale-down or maintenance with only a SIGTERM and a ten-second grace period, so anything in memory that matters must already be elsewhere. WebSockets and HTTP/2 streaming work, but a connection still lives inside the request timeout, so clients must reconnect at least hourly. None of these are gotchas once known; all of them are gotchas the first time.

Cloud Run, GKE, or Cloud Functions?

The honest decision is narrower than vendors make it sound. Within GCP, three compute options cover most stateless workloads, and the choice follows from how much of the machine you need to see.

Pick	When	You give up
Cloud Functions	Small event glue: a bucket trigger, a Pub/Sub handler, a webhook. Code measured in one file.	Control of the runtime, the process model, and the packaging; multi-route services get awkward
Cloud Run	Almost any stateless HTTP service or batch container. The sensible default for new services on GCP.	Daemons and agents, GPUs in most regions, kernel access, anything that must outlive a request
GKE	Stateful workloads, sidecars and operators, custom networking, GPU fleets, or dozens of services where cluster economics win	The zero-ops model: you now own upgrades, node pools, and capacity planning

A reasonable rule: start with Cloud Run and let a concrete limit push you off it. If your service is stateless HTTP and fits in 32 GiB, Cloud Run does what GKE would do with a fraction of the operational surface. Move to GKE when you need things the request model cannot express — long-lived daemons, StatefulSets, DaemonSets, custom schedulers — or when you are running enough steady workloads that paying for a cluster beats paying per request. Cloud Functions, after its second generation, actually runs on Cloud Run's infrastructure anyway; it survives as the ergonomic wrapper for tiny event handlers where even writing a Dockerfile is ceremony. The platform-level view of where each service sits in GCP's stack is in the GCP foundations page.

What it costs, and why concurrency is the lever

Cloud Run bills three meters: vCPU-seconds, GiB-seconds of memory, and a flat per-request fee (around 40 cents per million). In the default request-based mode, the CPU and memory meters run only while an instance has requests in flight, rounded up to the nearest 100 ms, and a monthly free tier absorbs small workloads entirely. The numbers per second are unremarkable; what makes bills small or large is the concurrency arithmetic from earlier. Eighty overlapping requests on one instance bill one instance's meter, so the effective cost per request falls almost linearly as concurrency rises — provided your requests spend their time waiting rather than computing. A service doing 100 ms of pure CPU per request gains nothing from concurrency and prices out like any other compute; a service doing 5 ms of CPU inside 100 ms of wall time is nearly free at scale.

Two settings move the bill in the other direction, both deliberately. Min-instances buy latency with a steady idle charge. Instance-based CPU allocation buys background execution by paying for instance lifetime instead of request time — and for instances that are busy nearly all the time it is actually the cheaper mode, with committed-use discounts available on top. The mental model that survives contact with a real invoice: you are renting fractions of machines, the fraction is determined by how well requests pack into instances, and concurrency is the packing knob.

Lab: deploy, split traffic, watch it scale to zero

Twenty minutes, one terminal, real behaviour. You will deploy Google's public hello container, ship a second revision alongside it, split traffic between the two, watch the split happen, and confirm scale to zero. Everything fits in the free tier; the teardown at the end removes all of it. You need a GCP project with billing enabled and the gcloud CLI authenticated.

1. Set up and deploy. Enable the API and deploy the public image. The deploy creates the service, builds nothing, and gives you a URL.

gcloud config set project YOUR_PROJECT_ID
gcloud services enable run.googleapis.com

gcloud run deploy hello \
  --image=us-docker.pkg.dev/cloudrun/container/hello \
  --region=us-central1 \
  --allow-unauthenticated

When it finishes, note the service URL it prints, then confirm the service answers and look at what one revision means.

URL=$(gcloud run services describe hello \
  --region=us-central1 --format='value(status.url)')
curl -s "$URL" | head -5

gcloud run revisions list --service=hello --region=us-central1

You should see a single revision, hello-00001-something, taking 100% of traffic.

2. Ship a second revision, with no traffic. Change an environment variable — any config change makes a new revision — and hold it out of rotation with --no-traffic. The hello container prints its environment, so the two revisions are easy to tell apart.

gcloud run deploy hello \
  --image=us-docker.pkg.dev/cloudrun/container/hello \
  --region=us-central1 \
  --update-env-vars=COLOR=green \
  --no-traffic --tag=candidate

List the revisions again: two now exist, and the old one still owns 100% of traffic. The new one is reachable only on its tag URL — the printed URL beginning with candidate---. Hit that URL to smoke-test the candidate while production sees nothing.

3. Canary by percentage. Move 20% of live traffic to the new revision, then prove the split with repeated requests.

gcloud run services update-traffic hello \
  --region=us-central1 --to-tags=candidate=20

for i in $(seq 1 20); do curl -s "$URL" | grep -o 'COLOR=green'; done | wc -l

Out of 20 requests you should see roughly four hit the green revision — per request, not per client, which is the subtlety from the traffic-splitting section made visible. Promote it fully, and notice that rollback would be the same command with the percentages reversed.

gcloud run services update-traffic hello \
  --region=us-central1 --to-latest

gcloud run services describe hello --region=us-central1 \
  --format='value(status.traffic)'

4. Watch scale to zero. Generate a small burst, then go quiet. The instance count is visible in the metrics, but the cheapest observation is latency: a request against a warm instance returns in tens of milliseconds, and the first request after ten to fifteen idle minutes pays the cold start.

for i in $(seq 1 50); do curl -s -o /dev/null "$URL" & done; wait

time curl -s -o /dev/null "$URL"
# wait ~15 minutes with no traffic, then:
time curl -s -o /dev/null "$URL"

The second timing should be visibly slower — that gap is the cold start, and during the quiet period the service cost you nothing. In the console, the Metrics tab for the service shows "container instance count" rising for the burst and falling back to zero.

5. Teardown. One service, one command. Deleting the service removes all its revisions; nothing else was created.

gcloud run services delete hello --region=us-central1 --quiet

If you want to keep going. Redeploy with --concurrency=1 and repeat the burst — watch the instance count jump to dozens instead of one or two. Then try --min-instances=1 and confirm the cold start disappears. Both are one-flag experiments, and both make the cost model tangible in a way no pricing page does.

Cloud Run

A container, not a function

The Knative heritage

The request-driven model: scaling on concurrency, not CPU

Concurrency: the setting that changes everything

Cold starts and min-instances

CPU during requests, or CPU always

Revisions and traffic splitting

Jobs: the other half of the platform

Limits worth knowing before you commit

Cloud Run, GKE, or Cloud Functions?

What it costs, and why concurrency is the lever

Lab: deploy, split traffic, watch it scale to zero

Further reading

06 — Cloud Storage