Rate limiting.
A bouncer at the door letting people in at a steady pace, not all at once.
A service can only handle so many requests at once before it buckles. Rate limiting is the bouncer on the door: it caps how many requests any one caller can make in a given window of time.
Go over the limit and you are turned away — usually with a "429 Too Many Requests" — until the window resets and you are let back in.
- In! In! In!1
A service buckles when everyone shows up at once and pounds on it.
- One at a time, folks.2
So you put a bouncer on the door who admits people at a steady pace.
- 3
For each caller, he keeps a tally: how many requests this minute, against the cap.
- Come back in a minute.4
Over the cap, you are turned away with a 429 until the window resets.
- 5
The token-bucket trick: each request spends a token, and tokens drip back at a fixed rate.
- Saved these up.6
A quiet caller banks tokens, so it can briefly burst — generous, but never over the average.
What it protects against
Rate limits guard a service from being overwhelmed, whether by accident or on purpose: a buggy client stuck in a loop, a sudden surge of real users, or an attacker hammering an endpoint to knock it over. They also keep shared resources fair, so one heavy user cannot starve everyone else.
A polite limiter tells you how long to wait (a Retry-After header), and well-behaved clients back off instead of pounding the door.
Allowing bursts without losing control
The cleverness is in being fair to real, bursty traffic. The popular "token bucket" gives each caller a bucket that refills at a steady rate; each request spends a token. A quiet caller builds up a few tokens and can briefly burst, but nobody can exceed the long-run average. It feels generous to humans while still capping the total load.