Walkthrough: Designing a Rate Limiter

The problem

You’re asked to design a distributed rate limiting service — a standalone microservice that other services call to enforce usage quotas. The interviewer says something like: “Design a rate limiter service that multiple teams can use to protect their APIs from abuse and enforce per-client quotas.”

Sounds like a counter and a threshold check. It isn’t. This is the canonical “distributed counting under latency constraints” problem, and the traps are in three places: race conditions when multiple gateway instances check the same counter simultaneously, the fail-open vs fail-closed decision when your counter store is unreachable, and the tension between global accuracy and per-request latency when your service spans regions. The interviewer is watching for whether you treat this as a simple if count > limit check or recognize it as a coordination problem.

Below is how I’d walk through this, roughly in the order I’d speak the words.

1. Clarify before you design

First 3–5 minutes. Resist the urge to start drawing.

Questions I’d ask:

Architecture. Is this an embedded library, a sidecar, or a standalone service? (Standalone microservice — other teams call it as a dependency. This means we own the API contract, storage, and availability SLA.)
Multi-tenant. How many client services will use this? Tens? Hundreds? (Hundreds — the service must isolate tenants and let each configure their own rules.)
Limiting dimensions. What do we limit by — IP, user ID, API key, endpoint, some combination? (Multiple dimensions simultaneously. A single request might be checked against per-IP and per-user limits.)
Scale. How many requests per second flow through the limiter? (500K+ QPS across all tenants.)
Scope. Single data center to start, or globally distributed? (Start single DC, extend to multi-region later.)
Hard vs soft limits. Do we reject immediately or allow grace? (Hard reject with 429 Too Many Requests.)
Rule management. Where do rules live? (Part of the service — teams configure rules through our API.)

The multi-tenancy and multi-dimension requirements matter most here. Multi-tenancy means rule isolation and fair resource allocation. Multi-dimension means a single request may trigger multiple counter lookups — latency multiplies.

Say the interviewer confirms: standalone multi-tenant service, 500K+ QPS, multi-dimension limiting (IP, user, endpoint), single DC first, hard 429 reject, rules managed within the service.

2. Capacity estimate

Brief. The point is to size the problem.

500K req/sec across all tenants → ~43B requests/day.
Each request checks 2–3 dimensions on average → 1–1.5M counter lookups/sec.
Counter storage: with token bucket, each counter is ~20 bytes (tokens remaining
- last refill timestamp). At 100M active users × 3 dimensions = 300M keys × 20 bytes ≈ 6 GB of counter state. Fits comfortably in memory.
Rule storage: thousands of rules across all tenants, each ~200 bytes. Trivial — fits in a single-node cache.

I’d say out loud: “This tells me two things. One — counter state fits in memory, so an in-memory store like Redis is viable. Two — the bottleneck is counter lookup throughput, not storage. At 1M+ lookups/sec we need either a fast store or local caching with accuracy trade-offs.”

3. API design

Two surfaces: the check endpoint (called on every request, latency-critical) and the rule management API (low-volume, admin).

# Rate limit check (called by client services on every inbound request)
POST /v1/check
  body:    { service_id, endpoint, identifiers: { ip, user_id, api_key } }
  returns: { allowed: bool, remaining: int, retry_after_ms?: int }
  errors:  429 — rejected (returned to the original caller by the client service)

# Rule management
PUT /v1/rules/{rule_id}
  body:    { service_id, dimension, endpoint_pattern, limit, window_sec }
  returns: 200 OK

GET /v1/rules?service_id=X
  returns: [{ rule_id, dimension, endpoint_pattern, limit, window_sec }]

DELETE /v1/rules/{rule_id}
  returns: 204 No Content

Example rule: “For service payments-api, on endpoint /transfer, allow 10 requests per second per IP and 100 requests per second per user.”

Decisions worth calling out:

Atomic check-and-decrement. The /check endpoint both evaluates and decrements in one call. Splitting into “check then decrement” introduces a TOCTOU race where two requests both read “1 remaining” and both proceed.
Multi-dimension in one call. The client passes all identifiers; we evaluate all matching rules and return the most restrictive result. One network round-trip, not one per dimension.
service_id for tenant isolation. Every rule and counter is scoped to a tenant. One team’s misconfiguration can’t affect another’s.

4. Algorithm choice

There are four canonical rate limiting algorithms. I’ll compare briefly and pick one — for full details on each, see the Rate Limiting Algorithms reference.

Algorithm	Memory per key	Burst handling	Boundary spike?	Complexity
Fixed window counter	8 bytes	None	Yes — 2x burst at boundary	Trivial
Sliding window log	O(n) timestamps	Precise	No	High memory
Sliding window counter	16 bytes	Approximated	Minimal	Low
Token bucket	16 bytes	Natural burst allowance	No	Low

I’d pick token bucket for this service. Here’s why:

Natural burst tolerance — a client can spend accumulated tokens in a burst, which matches real API usage patterns (batch calls followed by silence).
O(1) memory per counter — just tokens_remaining and last_refill_ts.
No background refill thread — tokens are calculated lazily on each check: tokens = min(capacity, tokens + (now - last_refill) × rate).
Well-understood, easy to explain to teams configuring rules (“you get 100 tokens, refilled at 10/sec, bucket holds max 100”).

For the rest of this walkthrough, the algorithm is token bucket. Each counter stores two values: how many tokens remain, and when we last refilled.

5. Design — Phase 1: Database-backed (the simple version)

Start simple. This is what I’d draw first in an interview — a correct design that obviously works but won’t scale. Then we evolve it.

flowchart LR
  Client[Client Service] --> RLS[Rate Limiter Service]
  RLS --> DB[(Database)]
  RLS --> Rules[Rules Table]

How it works

Client service calls POST /v1/check with identifiers.
Rate limiter loads matching rules from the rules table.
For each rule, looks up the counter row in the database: SELECT tokens, last_refill_ts FROM counters WHERE key = ?
Calculates refilled tokens: tokens + elapsed × rate, capped at capacity.
If tokens >= 1: decrement and update the row. Return allowed: true.
If tokens < 1: return allowed: false, retry_after_ms: ....

Data model

CREATE TABLE rules (
  rule_id     UUID PRIMARY KEY,
  service_id  VARCHAR NOT NULL,
  dimension   VARCHAR NOT NULL,  -- 'ip', 'user_id', 'api_key'
  endpoint    VARCHAR NOT NULL,  -- pattern like '/transfer' or '*'
  capacity    INT NOT NULL,      -- max tokens (burst size)
  refill_rate FLOAT NOT NULL,    -- tokens per second
  window_sec  INT NOT NULL
);

CREATE TABLE counters (
  counter_key   VARCHAR PRIMARY KEY,  -- e.g. 'payments-api:/transfer:ip:1.2.3.4'
  tokens        FLOAT NOT NULL,
  last_refill   TIMESTAMP NOT NULL
);

The race condition

Two requests arrive simultaneously for the same counter. Both read tokens = 1. Both compute “1 >= 1, allow.” Both decrement to 0. Actual admission: 2 requests when we should have allowed 1.

Fix: wrap the read-compute-write in a transaction with row-level locking:

BEGIN;
SELECT tokens, last_refill FROM counters WHERE counter_key = ? FOR UPDATE;
-- compute new token count
UPDATE counters SET tokens = ?, last_refill = ? WHERE counter_key = ?;
COMMIT;

This is correct. It is also slow — row-level locks under 500K QPS will crush any relational database. The lock contention on hot keys (popular IPs, heavy users) makes this quadratically worse.

I’d say: “This works at low scale — maybe 1K QPS. But our requirement is 500K+. The database becomes the bottleneck because every check is a transactional write to a shared row. We need a faster counter store.”

This motivates Phase 2.

6. Design — Phase 2: Redis-backed with failure handling

flowchart LR
  Client[Client Service] --> RLS[Rate Limiter Service]
  RLS --> Redis[(Redis Cluster)]
  RLS --> RulesCache[Rules Cache]
  Admin[Admin API] --> RulesDB[(Rules DB)]
  RulesDB -.->|push/poll| RulesCache

Why Redis

Atomic operations. Redis executes Lua scripts atomically — the read-compute-write for token bucket becomes a single command with no race condition and no locking overhead.
In-memory speed. Sub-millisecond latency per operation. At 1M+ ops/sec with pipelining, a modest Redis cluster handles our load.
TTL for automatic cleanup. Counters that go idle expire naturally. No GC job needed.
Cluster mode. Shard by counter key across nodes for horizontal scale.

The Lua script (atomic token bucket)

-- KEYS[1] = counter key
-- ARGV[1] = capacity, ARGV[2] = refill_rate, ARGV[3] = now_ms

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

-- Calculate refill
local elapsed = (now - last_refill) / 1000
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

if new_tokens >= 1 then
  new_tokens = new_tokens - 1
  redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
  redis.call('PEXPIRE', key, capacity / refill_rate * 1000 + 1000)
  return {1, math.floor(new_tokens)}  -- allowed, remaining
else
  local retry_after = math.ceil((1 - new_tokens) / refill_rate * 1000)
  return {0, 0, retry_after}  -- rejected, remaining, retry_after_ms
end

This runs atomically on a single Redis node — no race condition, no distributed lock. The counter key is {service_id}:{endpoint}:{dimension}:{value}, and Redis cluster hashes on the {...} hash tag to co-locate related keys.

Multi-dimension evaluation

A single /check call might match multiple rules (per-IP and per-user). The rate limiter service:

Loads matching rules from the local cache.
Fires all Lua scripts in a Redis pipeline — one round-trip, multiple evaluations.
Returns the most restrictive result. If any dimension says “reject,” the request is rejected.

Pipeline keeps latency at one round-trip regardless of how many dimensions we check.

Rules cache

Rules change infrequently (teams update them maybe a few times a day). The rate limiter keeps rules in a local in-memory cache, refreshed every few seconds from the rules database. This keeps the hot path — the /check endpoint — free of database calls.

Failure handling

Redis is now a critical dependency. What happens when it’s unavailable?

Fail-open (default). If Redis is unreachable, allow the request. The rate limiter is a protective layer, not a gatekeeper — service availability matters more than enforcement during a brief outage. The system is “unprotected” for the duration of the failure, but no legitimate traffic is blocked.

Fail-closed (opt-in per rule). For sensitive endpoints — payment processing, expensive operations — teams can mark rules as fail_closed: true. If Redis is down and the rule is fail-closed, reject with 429. This is appropriate when over-admission has direct financial cost.

Local fallback limiter. During Redis failure, the rate limiter falls back to a per-instance in-memory token bucket with conservative limits (e.g., global limit / number of instances). This is imprecise — each instance limits independently, so aggregate admission is higher than intended — but it provides a safety floor.

Detection and recovery:

Health check pings Redis every second. Three consecutive failures trigger the failover.
Once Redis recovers, counters resume from wherever they are. Since token bucket state is self-correcting (tokens refill over time), there’s no need to reconcile — stale counters just have fewer tokens than they should, which is conservative and safe.

Hot-key mitigation

Some counters are extremely hot — a global per-service limit, or a popular API’s per-endpoint counter. A single Redis node serving millions of reads/sec for one key becomes a bottleneck.

Mitigation: local token bucket with periodic sync. For counters exceeding a QPS threshold, the rate limiter instance keeps a local token bucket and periodically synchronizes with Redis (every 100ms or every N requests). This trades precision for throughput — the counter may over-admit by up to (sync_interval × rate × num_instances) tokens during the sync gap.

I’d say explicitly: “For most counters, the Redis round-trip is fine — sub-millisecond. For the top 0.01% of hot keys, we switch to local + sync. This is a deliberate accuracy-latency trade-off that I’d want to make configurable per rule.”

7. Design — Phase 3: Multi-region / global rate limiting

flowchart TB
  subgraph US-East
    Client1[Client] --> RLS1[Rate Limiter]
    RLS1 --> Redis1[(Redis)]
  end
  subgraph EU-West
    Client2[Client] --> RLS2[Rate Limiter]
    RLS2 --> Redis2[(Redis)]
  end
  Redis1 <-->|async sync| Redis2

When the service goes multi-region, the question becomes: does “100 requests per second per user” mean 100 globally, or 100 per region?

Option A: Per-region budgets (simple, fast).

Split the global limit across regions proportionally. If US-East handles 60% of traffic, it gets 60 of the 100 tokens. Each region enforces independently against its local Redis — no cross-region coordination, no added latency.

Downside: if traffic shifts (a region goes down, users travel), the budget split is wrong. Requires periodic rebalancing based on observed traffic distribution.

Option B: Async cross-region sync (accurate, eventually consistent).

Each region maintains its own Redis and syncs counter deltas to other regions asynchronously (every 1–5 seconds). Global token count is the sum across regions. Over-admission during the sync window is bounded by sync_interval × rate × num_regions.

For a limit of 100 req/sec with 3 regions syncing every 1 second, worst-case over-admission is ~3 extra requests per second. For most use cases, this is acceptable.

Option C: Single global Redis with cross-region latency.

All regions talk to one Redis cluster. Correct, but adds 50–150ms of cross-region latency to every rate limit check. At 500K QPS, this is usually unacceptable.

I’d recommend Option B — async sync gives us global accuracy within a bounded error margin, without adding latency to the hot path. For teams that need strict global limits (billing, compliance), offer Option C as an opt-in with the latency caveat documented.

Conflict resolution

When syncing counters across regions, we need a merge strategy. Token bucket counters are monotonically decreasing between refills, so the merge is: take the minimum tokens across regions (most conservative), and the maximum last_refill timestamp. This ensures we never over-count available tokens.

8. Tradeoffs and alternatives

Decision	We chose	Alternative	Why not
Architecture	Standalone service	Embedded library	Can’t enforce global limits; no centralized rules
Counter store	Redis	Database	Row locking collapses at 500K QPS
Algorithm	Token bucket	Sliding window counter	Token bucket handles bursts naturally; simpler mental model for teams
Failure mode	Fail-open (default)	Fail-closed	Availability > enforcement for most use cases
Multi-region	Async sync (Option B)	Single global store	Cross-region latency unacceptable for hot path
Hot keys	Local bucket + periodic sync	Always hit Redis	Sub-ms latency required for top 0.01% keys

9. Failure modes

What breaks	Detection	Recovery
Redis node failure	Health check, cluster MOVED responses	Cluster failover to replica; local fallback during transition
Full Redis cluster down	Health check timeout	Fail-open with local fallback limiter; alert
Rate limiter service crash	Load balancer health check	Stateless — new instance picks up immediately from Redis
Rule misconfiguration (limit=0)	Spike in 429 rate across a service	Canary rule deployment; automatic rollback on 429 spike
Clock skew between instances	Token refill divergence	Use Redis `TIME` command, not local clock
Network partition (multi-region)	Sync lag exceeds threshold	Fall back to per-region budget; alert for manual rebalancing

10. Extensions

Adaptive rate limiting. Adjust limits dynamically based on backend health — when downstream services are degraded, tighten limits automatically to shed load.
Priority-based limiting. Premium tenants get higher limits; during contention, degrade free-tier clients first.
Rate limit headers. Return X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset headers so clients can self-throttle before hitting 429.
Analytics and observability. Track rejection rates per service, per endpoint, per dimension. Surface dashboards so teams can tune their rules.
Quota management. Extend from rate (requests/sec) to quota (requests/month) for usage-based billing.

11. Summary

The key decisions and why:

Standalone multi-tenant service — centralized enforcement, shared infrastructure, teams self-serve via rule API.
Token bucket algorithm — burst-friendly, O(1) per check, lazy refill means no background threads.
Redis with Lua scripts — atomic check-and-decrement eliminates race conditions without distributed locks.
Progressive evolution — DB-backed (correct but slow) → Redis-backed (fast and correct) → multi-region (global with bounded error).
Fail-open default — availability over enforcement, with opt-in fail-closed for sensitive endpoints.
Async cross-region sync — global accuracy within bounded error, no cross-region latency on the hot path.

Tags #system-design#interview-prep#rate-limiter#walkthrough