Skip to main content

Walkthrough: Designing a Full Ranking System

A candidate's-eye walkthrough of the full-ranking / arbitrary-rank lookup system design question — when top-K isn't enough, why sorted sets don't scale to billions, and how histograms, bucketed counts, and offline precomputation compose into a multi-tier design.

The problem

You’re asked to design a system that returns arbitrary ranks, not just the top-K. Given a video ID, return its global rank (which might be 54,311 or 1,234,545,678). Given a rank, return the video at that position. The interviewer says something like: “Design a ranking system for YouTube videos by view count. A user should be able to look up the rank of any specific video and browse the full leaderboard at any position.”

This looks like a harmless extension of top-K — “just sort everything instead of taking the top 100.” It isn’t. The trap is that top-K tolerates approximation on the tail (“is #xyz in the top 100 or not?”), but full ranking asks about items deep in the tail, where approximation breaks down and every structure we used for top-K either doesn’t scale or doesn’t support the query. Candidates who reach for a single Redis sorted set will be out of memory at roughly the first billion entries. Candidates who recognize that no single structure serves all rank queries — and who propose a tiered design with different storage for hot top positions, mid-tail ranks, and deep archival — pass the gate. The interviewer is testing whether you can decompose a query surface by access pattern instead of forcing one data structure to do everything.

Below is how I’d walk through this, roughly in the order I’d speak the words, with the architecture evolving stage by stage.

1. Clarify before you design

First 3–5 minutes. Resist the urge to start drawing.

Questions I’d ask:

  • What’s the ranked dimension? View count, upvotes, revenue, a composite score? Composite scores change the update pattern significantly.
  • What are the query shapes?
    • rank_of(item_id) — given an item, return its rank.
    • item_at(rank) — given a rank, return the item.
    • range(rank_from, rank_to) — browse a contiguous range.
    • neighbors(item_id, ±N) — “show me my rank and nearby.”
    • Which of these are hot? Which are rare?
  • Total items to rank. Millions? Billions? 12 billion videos and 10 million products have very different designs.
  • Update rate. How often does the ranked value change? Every view is a +1; every upvote is a +1; daily batch recompute is different.
  • Freshness SLA. “My rank should be current within a second” is a different system than “ranks updated daily.”
  • Exact or approximate? Top-100 ranks usually need to be exact (user-facing leaderboard). Rank 54,311 vs 54,320 — does anyone actually notice? Deep tail ranks are almost always fine to approximate.
  • Range constraints. Does the product expose rank 1 billion, or does it cap at rank 10,000? Many real products never show ranks past a certain depth.

Say the interviewer confirms: ranking YouTube videos by view count, all query shapes, ~12 billion total videos, views arrive at ~1M/sec globally, top 100K must be exact and fresh within seconds, deeper ranks can be approximate and minutes stale, deepest ranks (beyond ~10M) can be hours stale.

The tiered freshness/accuracy requirement is the key signal. The interviewer is telling you: different rank ranges get different treatment. If you build one uniform system, you’ll either overpay massively or fail the SLA.

2. Capacity estimate

Brief. The numbers decide the architecture.

  • 12B videos × (~20-byte ID + 8-byte count) ≈ ~340 GB just for the (id, count) pairs, no index.
  • A Redis sorted set at ~80 bytes per entry × 12B ≈ ~1 TB in memory. Possible with a sharded cluster, but expensive and the hot-key write pattern makes it fragile.
  • View events at 1M/sec × 100 bytes ≈ 8.6 TB per day of raw events. Archival-scale, not online-query-scale.
  • Top-100K sorted set: 100K × 80 bytes = ~8 MB. Trivial.
  • Mid-tier (say 100K to 10M): 10M × 80 bytes = ~800 MB. Fits on one Redis instance.
  • Tail (10M to 12B): needs a column store or offline index.

Out loud: “The memory gradient tells the design. Top positions fit in memory on a laptop; the tail needs a different architecture. The query gradient is the same shape — top positions are hot, the tail is cold. Match storage tiers to access tiers.”

3. API design

GET /videos/{video_id}/rank
  returns: { video_id, rank, view_count, confidence: "exact|approx" }

GET /rankings/at/{rank}
  returns: { rank, video_id, view_count, confidence }

GET /rankings/range?from={r1}&to={r2}
  returns: [{ rank, video_id, view_count }, ...]

GET /videos/{video_id}/neighbors?radius=10
  returns: [{ rank, video_id, view_count }, ...]  # ±10 around the target

Two decisions worth calling out:

  • confidence field is part of the contract. The response is labeled exact or approximate, so consumers know what they can trust. Top-100K queries return exact; deep-tail queries return approx.
  • Range queries are bounded server-side. to - from ≤ 1000 at most. No request can ask for “ranks 5M through 10M” in one shot — that’s a scan, not a lookup.

4. Evolving the architecture step by step

Four stages, each fixing the previous stage’s bottleneck.

Stage 1: Naive — one Redis sorted set

flowchart LR
  Client[Client] --> API[API server]
  View[View event] --> API
  API -->|ZINCRBY| Redis[(Redis · one giant ZSet)]
  API -->|ZRANK / ZREVRANGE| Redis

One sorted set for all 12B videos. ZINCRBY on each view. ZRANK(video_id) for rank lookup. ZREVRANGE rank rank for item_at(rank). Range queries via ZREVRANGE from to.

Why this fails:

  • Memory. ~1 TB in a single ZSet. Even sharded across a Redis cluster, every write is O(log N) against a 12-billion-element skiplist.
  • Hot write path. 1M/sec ZINCRBYs on a single sorted set is impossible. Sharding ZSets by video ID breaks ZRANK — rank is a global property, not shardable by key.
  • Hot keys. A viral video sends bursts of writes to one shard.
  • Read amplification. ZRANK on a sharded ZSet requires computing ranks across all shards and summing offsets — expensive per query.

Call it out quickly: “A single sorted set is the obvious first answer and it breaks at roughly 10M videos or 10K writes/sec. Let’s separate the hot path from the cold path.”

Stage 2: Tier the storage by rank range

Key insight: rank lookup for the top 100K is an entirely different problem from rank lookup for the 1-billionth item. Top positions are hot, small, need exactness; the tail is cold, enormous, tolerates approximation. Don’t force one structure to serve both.

flowchart LR
  View[View event] --> API[API server]
  API -->|ZINCRBY| RedisTop[(Redis · top 100K ZSet)]
  API -->|UPSERT count| DB[(OLAP store · full 12B)]
  Query[Query] --> Router{Rank range?}
  Router -->|top 100K| RedisTop
  Router -->|deeper| DB

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class DB,Router new

Two tiers:

  • Hot tier: Redis sorted set for top 100K. Exact, fast, small (~8 MB). ZINCRBY on writes; ZRANK / ZREVRANGE on reads. Videos that fall out of the top 100K are evicted from this set.
  • Cold tier: OLAP column store (ClickHouse) for all 12B. One row per video: (video_id, view_count, updated_at). Indexed on view_count for range queries.

For a rank query:

  • If video_id is in the Redis top-100K set → return its exact rank from the ZSet.
  • Otherwise → go to the OLAP store. Use the counting trick:
SELECT count() FROM videos WHERE view_count > (
    SELECT view_count FROM videos WHERE video_id = 'abc'
)

The answer + 1 is the rank. This reframes “find the position in a sorted list” as “count items above a threshold” — which OLAP stores handle well.

Complexity of the counting trick. This is not O(log N). It’s O(N/P) parallel scan for ClickHouse, or O(log N + K) for Elasticsearch BKD trees (K = number of matches). For a video in the top 1000, K is small and the query is fast. For a video deep in the tail, K is in the billions and the query is slow. A naive counting query at the tail can take seconds.

What’s still broken:

  • Tail rank queries are too slow for online use.
  • item_at(rank) for huge rank values requires ORDER BY ... LIMIT 1 OFFSET N where N is billions — infeasible.
  • Writes at 1M/sec to ClickHouse are too much for per-event updates.

Stage 3: Add Kafka + stream aggregation for writes, precompute tail ranks offline

flowchart LR
  View[View event] --> Gateway[Ingestion gateway]
  Gateway --> Kafka[(Kafka · partitioned by video_id)]
  Kafka --> Flink[Flink · aggregate per video_id]
  Flink -->|hot videos| RedisTop[(Redis · top 100K)]
  Flink -->|all videos| DB[(ClickHouse · 12B rows)]
  Kafka -.raw.-> S3[(S3 · raw event archive)]
  Batch[Nightly Spark job] --> DB
  Batch --> RankIndex[(Rank index · per-rank-range files)]
  Batch --> Histogram[(View-count histogram)]

  Query[Query] --> Router{Query shape?}
  Router -->|top 100K rank / range| RedisTop
  Router -->|mid-tier rank| DB
  Router -->|deep-tail rank| Histogram
  Router -->|item_at rank| RankIndex

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class Gateway,Kafka,Flink,S3,Batch,RankIndex,Histogram,Router new

Now the write path is properly decoupled and the tail is precomputed.

Write path:

  • Raw view events go to Kafka, partitioned by video_id so a given video’s updates are serialized.
  • Flink aggregates per video over a small window (e.g., every 10 seconds): collapse N views into a single +N update.
  • Flink writes the aggregate to Redis (if the video is in the top 100K) and to ClickHouse. Raw events go to S3 for replay.

Read path, by tier:

  • Top 100K. Redis ZSet. ZRANK / ZREVRANGE. Exact, ms.
  • Mid-tier (100K to 10M). ClickHouse counting trick. The key optimization: precomputed view_count histogram. Divide the view-count range into buckets (say 100K buckets by log-scale), and maintain count(videos) per bucket. Rank estimate:
    • Video has view_count = 5,000.
    • Histogram: buckets above view_count 5,000 sum to 54,310.
    • Rank ≈ 54,311.
    • Updated every few minutes in the background. Accuracy within the bucket’s resolution (~100 videos).
  • Deep tail (10M+). Pure histogram estimate. Accept rank error of hundreds or thousands — no one cares about the difference between rank 1,234,545,678 and 1,234,545,800.
  • item_at(rank). Precomputed daily by a Spark job: sort all videos by view_count, partition into fixed-size files (1M ranks per file). Lookup = file_index = rank / 1M, read file, return the item at offset rank % 1M. Daily freshness.

The histogram is the key trick. It converts “rank lookup in a billion-row table” from “count billions of rows” to “sum 100K precomputed bucket counts” — milliseconds instead of seconds.

What’s still limited:

  • Mid-tier counting trick can still be slow if the histogram is coarse. Typically fine.
  • item_at(rank) for huge ranks is at most a day stale.
  • Top-100K membership changes require eviction logic when a video’s count drops out.

Stage 4: Close the loop — incremental histogram, approximate tail with sketches

Last optimization pass, only if the interviewer pushes further.

  • Incremental histogram maintenance. Instead of recomputing the histogram every few minutes via full scan, maintain it incrementally from the Kafka stream: each aggregate update decrements the old bucket and increments the new bucket. Histogram stays fresh within seconds.
  • Quantile sketches for very deep tail. For ranks past 100M, even the histogram is overkill. A t-digest or KLL sketch over the view_count distribution gives rank estimates with logarithmic memory. These structures answer “at what rank is view_count X” directly, without a bucket scan.
  • Neighbors queries. neighbors(video_id, ±10) for a top-100K video is free from Redis. For a mid-tier video, the counting trick gives the rank, then a second query SELECT ... WHERE view_count BETWEEN (count - ε) AND (count + ε) ORDER BY view_count LIMIT 20 returns neighbors. ε is chosen from the histogram density.

This is the final design. Top positions are exact and live; the middle is precomputed to milliseconds via histogram; the tail is approximate and acceptable.

5. Why no single structure works

A common misconception: “just use a sorted index on view_count.”

It’s instructive to enumerate why each candidate structure fails alone:

  • B-tree / RDBMS index on view_count. Write amplification at 1M/sec is too high; COUNT(*) WHERE view_count > X scans the tail; ORDER BY ... OFFSET N is linear.
  • Redis sorted set. Memory is ~1 TB; single-key bottleneck on writes; sharding doesn’t preserve global rank semantics.
  • Elasticsearch. BKD tree on numeric fields handles range queries well (O(log N + K)), and it’s often the right answer for the mid-tier. But at 12B documents the index is huge and reindex costs dominate; deep from pagination is explicitly capped (max_result_window).
  • ClickHouse alone. Great for counting queries, but 1M/sec per-event writes are expensive; ORDER BY ... LIMIT OFFSET at billion-scale offsets is still slow.
  • Count-Min Sketch / Heavy Hitters. Great for “what’s popular,” useless for “what’s the rank of this specific item” — the sketch can’t enumerate and Heavy Hitters only knows its top-m.

Each structure is right for one part of the query surface. The design lesson: map each query class to the structure that serves it, and let the API router dispatch.

6. The counting trick in detail

This is the move that turns rank lookup into a tractable query. Worth explaining carefully because it’s the non-obvious part.

Naive approach: “Find video_abc in the sorted list and count how many items are before it.” This requires either the full sorted list (infeasible) or an ORDER BY ... OFFSET N scan.

Counting approach:

  1. Look up the video’s own view_count: O(1) KV lookup.
  2. Count how many other videos have view_count strictly greater.
  3. That count + 1 is the rank.

Step 2 is a range aggregate, which is what columnar stores and inverted indexes are built for.

  • ClickHouse: SELECT count() FROM videos WHERE view_count > X. Columnar scan, parallelized, with sparse min/max index to skip data parts. For top-range queries (X is high) very fast; for bottom-range queries slower but still parallel.
  • Elasticsearch: count API with a range query on view_count. BKD tree narrows to matching documents. O(log N + K) where K is the match count.
  • Precomputed histogram: SELECT sum(count) FROM histogram WHERE bucket > bucket_of(X). Millisecond-level. Used for the mid-tier.

The fastest version always wins: precompute the histogram, answer from the histogram when acceptable, fall back to the live count query when not.

7. Data model and storage

views_live        (Redis · hot)
  top 100K videos as a ZSet, score = view_count

video_counts      (ClickHouse · all 12B)
  video_id, view_count, updated_at
  MergeTree, ORDER BY (view_count DESC, video_id)
  index on video_id for point lookups

count_histogram   (ClickHouse materialized view, or Redis)
  bucket (log-scale of view_count), cumulative_count
  updated incrementally or via periodic refresh

rank_index        (S3 · precomputed daily)
  sorted parquet files, 1M ranks per file
  lookup: file_idx = rank / 1M

raw_events        (S3 · archive)
  partitioned by hour, used only for replay

Storage choices:

  • views_live — Redis ZSet. Sub-millisecond rank / range queries for the top 100K. Eviction via a periodic job that compares current top 100K from ClickHouse and trims.
  • video_counts — ClickHouse. All 12B rows. view_count ordering supports efficient counting and ranges. Updated from Flink aggregates at batched write rates (not per event).
  • count_histogram — either a ClickHouse materialized view or a small Redis hash (bucket → count). Histograms are tiny (~100K buckets × 16 bytes = ~1.6 MB). Feeds the mid-tier rank query.
  • rank_index — S3 parquet, precomputed by a nightly Spark job. Serves item_at(rank) for deep ranks. Read in via a lightweight API server that knows the file layout.
  • raw_events — S3. Not on the query path. Used for recomputation if the ranking dimension changes or for audit.

Separation of concerns: online queries read from Redis / ClickHouse / histogram; offline queries (deep item_at) read from S3. No online query hits raw events.

8. The write path

flowchart LR
  View[View event] --> Gateway[Ingestion gateway]
  Gateway --> Kafka[(Kafka · partitioned by video_id)]
  Kafka --> Flink[Flink · aggregate + batch]
  Flink -->|+N batch| Redis[(Redis · top 100K)]
  Flink -->|+N batch| DB[(ClickHouse · all 12B)]
  Flink -.->|histogram delta| Hist[(count_histogram)]
  Kafka -.archive.-> S3[(S3)]
  • Kafka partitioned by video_id so updates to a given video serialize.
  • Flink aggregates over 10-second windows: one video with 1000 views in a window becomes one +1000 update, not 1000 increments. Reduces write load to ClickHouse by orders of magnitude.
  • The aggregation also emits histogram deltas: when a video moves from bucket B1 to B2, emit (B1, -1) and (B2, +1) to count_histogram. This keeps the histogram live without full recomputation.
  • Eviction from Redis top 100K happens in a background job that trims the ZSet to size 100K periodically. (Redis ZSets grow unbounded under ZINCRBY; you have to prune.)

9. Distribution and hot keys

  • Sharding ClickHouse. Shard by video_id. Queries for a specific video hit one shard; range queries on view_count fan out to all shards and merge. The ORDER BY (view_count, video_id) within each shard keeps per-shard ranges cheap; the fan-out merge is bounded by shard count, not data size.
  • Hot-key writes. A viral video sends 10% of the event stream to one Kafka partition. Because Flink aggregates per window, 1M views in 10 seconds collapse to a single +1,000,000 update — hot keys are actually easier in this architecture than in the top-K one, because the aggregation absorbs the burst before it hits storage.
  • Hot-key reads. “What’s the rank of the top video” is a common query. Served from Redis ZSet in microseconds. Cache the full top-100 result at the API layer with a few-second TTL.

10. Failure modes

  • Redis top-100K loss. Reconstruct from ClickHouse. A few minutes of rebuild; during rebuild, top-100K queries fall back to ClickHouse with a warning flag in the response.
  • ClickHouse shard down. Count queries return partial results. Degrade to histogram-based approximate rank until the shard recovers. For mid-tier queries, mark the response as confidence: degraded.
  • Histogram staleness. If incremental updates fall behind, the mid-tier rank is slightly stale. Monitor histogram lag; if above threshold, force a refresh from ClickHouse.
  • Kafka partition unavailable. acks=all, min.insync.replicas=2. Single broker loss invisible. Double loss stalls writes for events keyed to that partition; gateway rejects with retry.
  • Flink task crash. Restart from last checkpoint. Aggregates in flight during the crash window are lost — bounded, small, and dwarfed by histogram bucket resolution.
  • Rank index (S3) stale. Up to a day stale by design. item_at(rank) for deep ranks is labeled confidence: approx with the index timestamp in the response.
  • Boundary flapping for top-100K eviction. A video bouncing in/out of rank 99,999 / 100,001 causes repeated Redis ZSet churn. Mitigation: hysteresis — evict only when rank drops below 105K; admit only when rank rises above 95K. Smoother boundary.

The pattern: every tier has a graceful degradation story. Top tier falls back to mid tier; mid tier falls back to histogram; histogram falls back to last-known batch. The response always says confidence: ... so the consumer knows what they got.

11. Tradeoffs and alternatives

  • Single massive Redis cluster. Mentioned in stage 1. Works to maybe 100M entries, melts past that. Too expensive at 12B.
  • Single Elasticsearch index on view_count. Viable for hundreds of millions, but reindex costs and deep from pagination caps make it unsuitable for the full 12B with deep item_at queries.
  • Full offline precomputation (batch only). Every rank query served from daily-precomputed parquet files. Fails the freshness SLA for the top 100K by orders of magnitude. Mentioned to explicitly reject.
  • Approximation everywhere (sketches for all ranks). Works for analytics, fails for the user-facing leaderboard top-100, which must be exact.

12. Extensions and what I’d skip

Follow-ups worth a few sentences:

  • Multi-dimensional ranking. Rank by (views, recency, engagement score). Composite scores change at every event. Precompute materialized views per ranking dimension; the tiered design generalizes.
  • Personalized ranking. “Top videos for me.” One ranking per user is infeasible; segment users into a few hundred interest buckets, maintain per-bucket rankings. Approximate personalization at feasible cost.
  • Historical ranking. “What was this video’s rank a year ago?” The raw event archive in S3 supports it via batch replay. Not online-query-shape.
  • Percentile queries. “What’s my percentile among all videos?” Served directly from the histogram — percentile is the histogram CDF, no rank computation needed.
  • Ranking windows. “Top videos this week” vs all-time. Each window is its own materialized view; tier the same way.

Explicit defers:

  • Spam / fraud filtering on views. Upstream system.
  • Tie-breaking for equal view counts. Use video_id as secondary sort; mention it’s a correctness detail.
  • Real-time rank change notifications (“your video just entered the top 100”) — a consumer of the Redis ZSet change stream, not part of this system.

13. Wrap-up

One crisp sentence:

The design tiers storage to match the query surface: exact and live for the top 100K via Redis, counting queries against ClickHouse with a precomputed histogram for the mid-tier, and offline-precomputed parquet for the deep tail — with every response carrying a confidence label so consumers know what they got.

What separates SDE II from SDE III on this question

  • SDE II lands a sorted set or an indexed SQL table, recognizes it doesn’t scale, and struggles to articulate the right decomposition.
  • SDE III recognizes immediately that the query surface has heterogeneous access patterns and decomposes by rank range. Articulates the counting trick as the reframe from “position in list” to “count above threshold.” Names histograms as the mid-tier accelerator. Names precomputed parquet for deep item_at. Labels every response with a confidence field so the consumer contract stays honest. Names hysteresis for boundary flapping and graceful degradation per tier.

The differentiator isn’t tool knowledge. It’s whether you can split a query surface by access pattern and map each split to the right storage shape, rather than searching for one magic structure.

Further reading

Related on calm.rocks: