Walkthrough: Designing a Top K System

The problem

You’re asked to design a system that returns the top-K most frequent items over a time window at high throughput — trending hashtags on Twitter, top-K viewed products on Amazon, live leaderboards in a game. The interviewer says something like: “Design a system that returns the top 100 trending hashtags over the last hour, with freshness in the seconds.”

Sounds like sort-and-take-top-K. It isn’t. This is the canonical “approximation at scale” problem, and the trap is in the first thirty seconds — candidates who reach for a hash map + sort will be out of memory before the second follow-up question. Candidates who reach for Count-Min Sketch and a heavy-hitters algorithm pass the first gate. The interviewer is testing whether you recognize that exact counting at event-stream scale is infeasible and that the right answer is a bounded-error approximation with explicit accuracy guarantees.

Below is how I’d walk through this, roughly in the order I’d speak the words.

1. Clarify before you design

First 3–5 minutes. Resist the urge to start drawing.

Questions I’d ask:

What are we counting? Hashtag mentions, product views, video plays, search queries? The cardinality of the key space matters — hashtags are tens of millions; search queries are billions.
K value. Top 10, top 100, top 1000? K determines how much we need to track precisely versus approximate.
Time window. Last hour sliding? Last 24 hours? All-time? Window semantics drive the entire state model.
Freshness SLA. How stale can results be — one second, one minute, five minutes? Sub-second is a different system than every-minute.
Exact or approximate. Is “approximately correct” acceptable? For trending surfaces (social feeds, dashboards), yes. For billing or compliance, no.
Global or per-region / per-segment. One global top-K or sliced by region, language, user segment? Segmentation multiplies the state.
Event volume. Throughput of the incoming stream? The trending question at Twitter scale is ~1M events/sec.

Exact vs approximate is the single biggest question. If the interviewer says “must be exact,” the design shifts toward durable storage and batch reconciliation, and the problem becomes much less interesting.

Say the interviewer confirms: counting hashtag mentions, K=100, sliding 1-hour window, freshness within 10 seconds, approximate acceptable, global top-K, event volume ~1M/sec peak.

2. Capacity estimate

Brief. The numbers decide the architecture.

1M events/sec × 3600 seconds = ~3.6B events per hour in the window.
Unique keys per hour: hashtag cardinality is roughly millions active, but counting which specific hashtags are active per hour is itself a problem. Assume ~10M unique keys per hour, ~100M all-time.
Exact per-key count in a hash map: {key → count}. 10M keys × (~50 bytes key + 8 bytes count) = ~580 MB. Fits in memory on one machine for one hour.
But: cross-sharded, across regions, with a sliding window and hot keys under bursts? Exact counting at 1M events/sec with a single hash map collapses under write contention. Even sharded exactly, the hot-key problem will kill it.

I’d say out loud: “Exact counting is technically feasible in memory for a bounded window, but it melts under the write rate and the hot-key distribution. More importantly, the product doesn’t need exactness — nobody cares if a hashtag is ranked 47th or 49th. The right answer is an approximation with bounded error.”

3. API design

Two very different surfaces. Ingestion is the firehose. Query is dashboard-style, low-QPS, sensitive to latency.

POST /events
  body:    { key, timestamp, metadata{region?, segment?, ...} }
  returns: 202 Accepted

GET /top-k
  params:  window=1h, k=100, segment?
  returns: [{ key, approx_count, rank }, ...]

Two decisions worth calling out:

Query is cached aggressively. The top-K result for the current window changes slowly (second-to-second). A 1–5 second TTL at the edge is fine and eliminates 99%+ of read traffic.
No exact counts in the response. approx_count is labeled as such. This is a correctness contract with the consumer: the counts are bounded-error estimates, not ground truth.

4. Core design: the counting algorithm

This is the question. It’s where SDE II and SDE III answers diverge.

Three common approaches:

(a) Exact counting with a sharded hash map. Shard by key, each shard maintains {key → count}, query-time fan-out and merge.

Pros: Exact.
Cons: Hot-key skew — a trending hashtag can send 10% of traffic to one shard. Memory scales with unique key count, which is unbounded for things like search queries. Doesn’t handle sliding windows elegantly.

(b) Approximate counting with Count-Min Sketch alone. CMS over all events, periodically extract top-K by scanning candidate keys.

Pros: Sublinear memory, bounded error.
Cons: You still need a candidate set to query. “Find the top-K” over a sketch requires knowing which keys to query. CMS alone doesn’t solve the top-K problem, only the “count lookup given a key” problem.

(c) Hybrid: Count-Min Sketch + Heavy Hitters. CMS for general count queries, plus a bounded-size data structure (Space-Saving or Misra-Gries) that tracks the candidate top-K directly as events stream in.

Pros: Sublinear memory, bounded error, directly maintains the top-K answer — no full scan needed.
Cons: More complex to explain; two data structures to maintain per shard per window.

I’d pick (c), and I’d walk through the mechanisms — this is the section that separates candidates who recognize the algorithms by name from candidates who can implement them.

How Count-Min Sketch works

A CMS is a 2D array of counters, d rows by w columns, plus d independent hash functions h_1, h_2, ..., h_d, each mapping keys to columns.

On increment(key): for each row i in 1..d, compute h_i(key) and increment counters[i][h_i(key)] by one.

On estimate(key): for each row i, read counters[i][h_i(key)]. Return the minimum of the d values.

The minimum is the tightest upper bound, because every row counts at least as much as the true count (incrementing only ever adds; hash collisions only inflate), so taking the min across rows reduces collision error.

Error bounds: with w = ⌈e/ε⌉ columns and d = ⌈ln(1/δ)⌉ rows, the estimate is within εN of the true count (where N is total events) with probability 1-δ.

Worked example. Set ε=0.001 (0.1% error), δ=0.01 (99% confidence). Then w ≈ 2718, d ≈ 5. Memory: 5 × 2718 × 4 bytes = ~54 KB per sketch. One hour at 1M events/sec is 3.6B events total, so the per-key error is at most ~3.6M — meaning you can’t resolve keys with true count below ~3.6M from each other, but you can resolve anything above it. For trending hashtags where the top-K all have counts in the tens of millions, that’s plenty of resolution.

The state we keep is d × w integers. 54 KB. Not 580 MB.

How Space-Saving (Heavy Hitters) works

Maintain a fixed-size map of m candidate keys with counts, where m > K (typically m = 10K for top-100).

On increment(key):

If key is in the map, increment its count.
Else if the map has fewer than m entries, add key with count 1.
Else find the minimum-count entry in the map. Replace its key with the new key. Set the new key’s count to min_count + 1. The displaced key’s count is kept as the new entry’s count — this is the “saving” in Space-Saving.

On query: return the top-K entries by count.

The critical property: Space-Saving sketches are mergeable. Two Space-Saving structures covering disjoint event streams can be merged by summing counts for shared keys and taking the top m by count. The error bound of the merged sketch is the sum of the input error bounds — tight enough to be useful.

This is the property that lets us shard.

Worked example: m = 10K for top-100. At 50 bytes per entry (key + count + overhead), that’s 500 KB per heavy-hitters structure. Per shard, per window. Trivial memory.

5. Data model and storage

Most of the state is in-memory in the stream processor. The persistent stores are thin.

sketches_live      (per-shard, per-minute, in-memory)
  shard_id, minute_bucket → CMS + Space-Saving

sketches_persistent (periodic flush to durable storage)
  shard_id, minute_bucket, type → serialized sketch bytes

topk_cache         (query-serving)
  window_spec → top-K result (5-second TTL)

Storage choices:

sketches_live — in-memory in the Flink / stream processor task. Checkpointed periodically to the processor’s state backend (RocksDB or similar).
sketches_persistent — a KV store (Redis, Cassandra, or object storage). One serialized sketch per shard per minute bucket. Small, fast to read at query time.
topk_cache — Redis. Query endpoint reads here first; recomputes from sketches only on cache miss.

I’d say explicitly: “The expensive part of this system is counting, which lives in-memory. The persistent stores are just archives and result caches. That’s different from most systems — usually storage dominates, but here the data structure choice collapses storage to near-free.”

6. The write path

End to end:

flowchart LR
  Events[Event sources] --> Gateway[Ingestion gateway]
  Gateway --> Kafka[(Kafka · events topic, partitioned by key)]
  Kafka --> Flink[Flink · per-shard CMS + Heavy Hitters]
  Flink --> State[(RocksDB checkpoint state)]
  Flink --> Persist[(Sketch archive)]
  Flink --> Merge[Periodic global merge]
  Merge --> Cache[(Top-K cache)]
  Dashboard[Dashboard / API] --> Cache

The key structural choice: partition Kafka by the counted key. Each Flink shard owns a deterministic subset of keys; within a shard, increments are strictly serial, so the sketches update without lock contention. The shard’s state is d × w integers plus an m-entry heavy-hitters map — tens of KB, not MB.

7. Deep dive: sliding windows

The window semantics are harder than the counting. Three options:

(a) Tumbling windows. Reset at every hour boundary. Simple, but the “last hour” view jumps at every boundary — at 10:59 the top-K reflects 59 minutes; at 11:00 it reflects 0 minutes. The user experience is jarring.

(b) Bucketed sliding window. Maintain one sketch per minute (or per-10-second bucket) for the last 60 minutes. “Last hour” is the merge of the last 60 per-minute sketches. Each minute bucket is finalized once and never updated; old buckets age out. This is the standard answer.

(c) Exponential decay. Each event’s contribution decays over time with a decay factor. No explicit window — recent events count more. Elegant, but harder to explain and harder to reason about exact time boundaries.

I’d pick (b) — bucketed sliding windows — and walk through the mechanics:

60 buckets per shard, one per minute.
Each bucket is its own CMS + Space-Saving pair.
At query time, merge the 60 buckets for a shard into a combined sketch for the last hour. Merging CMS is column-wise addition; merging Space-Saving is the property named above.
Once merged across all shards, extract top-K.

Merge cost per query: 60 buckets × number of shards. At 64 shards and 60 buckets, that’s 3,840 sketch merges per query — milliseconds at most. Cached for 5 seconds, so query cost amortizes to near-zero.

Age-out: at the start of each minute, the oldest bucket is dropped. New events go to a new bucket. Old buckets are optionally archived to durable storage for historical queries.

8. Deep dive: distributed counting and merge correctness

For 1M events/sec peak, one machine can’t maintain the live sketches — the partitioning by key spreads the load across N Flink tasks.

Extending the architecture:

flowchart LR
  Events[Event sources] --> Gateway[Ingestion gateway]
  Gateway --> Kafka[(Kafka · N partitions, partitioned by key)]
  Kafka --> Flink1[Flink shard 1]
  Kafka --> Flink2[Flink shard 2]
  Kafka --> FlinkN[Flink shard N]
  Flink1 --> State1[(Per-shard buckets)]
  Flink2 --> State2[(Per-shard buckets)]
  FlinkN --> StateN[(Per-shard buckets)]
  State1 --> Merger[Query-time merger]
  State2 --> Merger
  StateN --> Merger
  Merger --> Cache[(Top-K cache)]
  Dashboard[Dashboard] --> Cache

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class Flink2,FlinkN,State2,StateN,Merger new

The merge correctness property is where this design earns its keep:

CMS merge. Given two sketches with identical d, w, and hash functions, merge is element-wise addition of their counter arrays. The merged sketch is mathematically equivalent to a single sketch that observed both event streams. Error bounds compose additively.
Space-Saving merge. Sum counts for shared keys, keep top m by count. The error bound of the merged structure is bounded by the sum of the per-structure error bounds. The Mergeable Summaries paper by Agarwal et al. is the formal treatment and worth linking.

Because both structures merge correctly, sharding is free at the algorithm level. The only cost is the query-time merge, which is bounded and cached.

Hot-key mitigation

A single hashtag going viral during the Super Bowl sends 10% of all events to one Kafka partition. The Flink task on that partition becomes the bottleneck. Same problem as the ad click aggregation walkthrough — use the same solution: salt the hot keys. Detect keys in the top 0.1% of recent volume (count-min sketch in the gateway is enough) and append a random suffix 0..N to their partition key. A second merge stage keyed on the unsalted key combines the N salted buckets at query time.

9. Failure modes

I’d proactively walk through what breaks.

Flink task crash. Restart from the last checkpoint (every 30 seconds). RocksDB state backend persists the sketches; replay of ~30 seconds of Kafka reconstructs the in-memory state. Small count drift possible during the replay window; for a trending system this is well within error bars.
Kafka partition unavailable. With acks=all and min.insync.replicas=2, a single broker loss is invisible. Two broker losses on the same partition stall writes for events keyed to that partition. Gateway rejects; SDK retries with backoff.
Sketch memory overflow. Can’t happen by construction — sketches are fixed-size. This is exactly the property we chose them for. Worth saying out loud: the approximation is not “nice-to-have,” it’s the reason the memory footprint is bounded under adversarial input.
Query merger overloaded. A query fanning out to 64 shards × 60 buckets can slow. Mitigation: pre-merge periodically in the background (every 10 seconds, compute the current top-K and cache it); user-facing queries hit the cache, not the merger.
Clock skew on event timestamps. Events with wrong client clocks land in wrong buckets. Server-stamp the authoritative timestamp at the ingestion gateway; use it for bucketing, not the client’s timestamp field.
Hot-key melt without salting. A trending key saturates one Flink task. Detection: per-task backlog metric. Mitigation: enable salting for keys above a volume threshold, via a config flip without a deploy.
Heavy-hitters eviction of genuine top-K. If m is set too close to K, a genuine top-K key can get evicted during a burst and its count reset. Mitigation: set m generously (10–100× K); monitor the displaced-key distribution.

The pattern to notice: name what fails, name what degrades gracefully, name what doesn’t. Because the product tolerates small error, almost everything here degrades to “slightly less accurate top-K” rather than “no top-K.” That’s a property of the approximation choice, not an accident.

10. Tradeoffs and alternatives

Briefly, three alternatives I considered and rejected.

Exact counting with sharded Redis. Works up to ~100K events/sec before hot-key contention destroys it. Mentioned to show I know the naive answer exists.
Batch-only (nightly Spark job). Fails the freshness SLA by two orders of magnitude. Mentioned because the alternative to streaming is always batch, and it’s worth explicitly rejecting here.
CMS without Heavy Hitters. CMS gives you per-key estimates, but the top-K problem needs a candidate set. Scanning all possible keys at query time is infeasible. Heavy Hitters is what makes top-K actually tractable; CMS alone is not sufficient for this problem.

11. Extensions and what I’d skip

Likely follow-ups worth a few sentences on each:

Cross-region. Per-region shards maintain local sketches; a global query merges across regions. Because sketches are mergeable, this is essentially free on the algorithm side; cross-region network cost is the only overhead.
Personalized top-K. “Top trending in your network” is a cardinality explosion — one sketch per user is infeasible. Solution: segment users into a small number of interest buckets, maintain per-bucket sketches. Approximate personalization at feasible cost.
Different window sizes. Top-K for the last minute, hour, day, week? Maintain finer-grained buckets (per-second for the last minute, per-minute for the last hour, per-hour for the last day). Storage cost is sublinear because each tier has fewer buckets.
Exactness for small K. For K=10 over all-time, full-fidelity counting for just the top 1000 candidates (identified via streaming heavy hitters) is feasible. Worth naming as a variant.

Explicit defers:

Spam and bot filtering on events. Real; a separate upstream system.
Per-tenant isolation if this is a multi-tenant API.
Historical archival and backfill. Possible via the sketch archive, but a separate pipeline.
Alerting on trending changes (“this hashtag just jumped 10×”). A consumer of the top-K stream, not a responsibility of this system.

12. Wrap-up

One crisp sentence before the interviewer’s next question:

This design trades exactness for tractability: the top-K is approximate with bounded error, but it holds the bound under 1M events/sec, a sliding hour window, and arbitrary sharding. That trade-off is the design.

That’s the kind of framing that lands.

What separates SDE II from SDE III on this question

SDE II usually lands a sharded counter, a sliding window, and reaches for approximation when prompted on memory.
SDE III names approximation in the first five minutes, explains Count-Min Sketch well enough to re-implement it, pairs it with Space-Saving for direct top-K maintenance, articulates the mergeability property as the reason sharding is free at the algorithm level, picks bucketed sliding windows with the merge math, and names hot-key salting as a real production concern.

The differentiator isn’t tool knowledge. It’s whether you recognize that “top-K at scale” is an approximation problem, and whether you can walk through the algorithms that make the approximation actually work.