The problem
You’re asked to design a system that returns arbitrary ranks, not just the top-K. Given a video ID, return its global rank (which might be 54,311 or 1,234,545,678). Given a rank, return the video at that position. The interviewer says something like: “Design a ranking system for YouTube videos by view count. A user should be able to look up the rank of any specific video and browse the full leaderboard at any position.”
This looks like a harmless extension of top-K — “just sort everything
instead of taking the top 100.” It isn’t. The trap is that top-K
tolerates approximation on the tail (“is #xyz in the top 100 or
not?”), but full ranking asks about items deep in the tail, where
approximation breaks down and every structure we used for top-K either
doesn’t scale or doesn’t support the query. Candidates who reach for a
single Redis sorted set will be out of memory at roughly the first
billion entries. Candidates who recognize that no single structure
serves all rank queries — and who propose a tiered design with
different storage for hot top positions, mid-tail ranks, and deep
archival — pass the gate. The interviewer is testing whether you can
decompose a query surface by access pattern instead of forcing one
data structure to do everything.
Below is how I’d walk through this, roughly in the order I’d speak the words, with the architecture evolving stage by stage.
1. Clarify before you design
First 3–5 minutes. Resist the urge to start drawing.
Questions I’d ask:
- What’s the ranked dimension? View count, upvotes, revenue, a composite score? Composite scores change the update pattern significantly.
- What are the query shapes?
rank_of(item_id)— given an item, return its rank.item_at(rank)— given a rank, return the item.range(rank_from, rank_to)— browse a contiguous range.neighbors(item_id, ±N)— “show me my rank and nearby.”- Which of these are hot? Which are rare?
- Total items to rank. Millions? Billions? 12 billion videos and 10 million products have very different designs.
- Update rate. How often does the ranked value change? Every view is a +1; every upvote is a +1; daily batch recompute is different.
- Freshness SLA. “My rank should be current within a second” is a different system than “ranks updated daily.”
- Exact or approximate? Top-100 ranks usually need to be exact (user-facing leaderboard). Rank 54,311 vs 54,320 — does anyone actually notice? Deep tail ranks are almost always fine to approximate.
- Range constraints. Does the product expose rank 1 billion, or does it cap at rank 10,000? Many real products never show ranks past a certain depth.
Say the interviewer confirms: ranking YouTube videos by view count, all query shapes, ~12 billion total videos, views arrive at ~1M/sec globally, top 100K must be exact and fresh within seconds, deeper ranks can be approximate and minutes stale, deepest ranks (beyond ~10M) can be hours stale.
The tiered freshness/accuracy requirement is the key signal. The interviewer is telling you: different rank ranges get different treatment. If you build one uniform system, you’ll either overpay massively or fail the SLA.
2. Capacity estimate
Brief. The numbers decide the architecture.
- 12B videos × (~20-byte ID + 8-byte count) ≈ ~340 GB just for
the
(id, count)pairs, no index. - A Redis sorted set at ~80 bytes per entry × 12B ≈ ~1 TB in memory. Possible with a sharded cluster, but expensive and the hot-key write pattern makes it fragile.
- View events at 1M/sec × 100 bytes ≈ 8.6 TB per day of raw events. Archival-scale, not online-query-scale.
- Top-100K sorted set: 100K × 80 bytes = ~8 MB. Trivial.
- Mid-tier (say 100K to 10M): 10M × 80 bytes = ~800 MB. Fits on one Redis instance.
- Tail (10M to 12B): needs a column store or offline index.
Out loud: “The memory gradient tells the design. Top positions fit in memory on a laptop; the tail needs a different architecture. The query gradient is the same shape — top positions are hot, the tail is cold. Match storage tiers to access tiers.”
3. API design
GET /videos/{video_id}/rank
returns: { video_id, rank, view_count, confidence: "exact|approx" }
GET /rankings/at/{rank}
returns: { rank, video_id, view_count, confidence }
GET /rankings/range?from={r1}&to={r2}
returns: [{ rank, video_id, view_count }, ...]
GET /videos/{video_id}/neighbors?radius=10
returns: [{ rank, video_id, view_count }, ...] # ±10 around the target
Two decisions worth calling out:
confidencefield is part of the contract. The response is labeled exact or approximate, so consumers know what they can trust. Top-100K queries returnexact; deep-tail queries returnapprox.- Range queries are bounded server-side.
to - from ≤ 1000at most. No request can ask for “ranks 5M through 10M” in one shot — that’s a scan, not a lookup.
4. Evolving the architecture step by step
Four stages, each fixing the previous stage’s bottleneck.
Stage 1: Naive — one Redis sorted set
flowchart LR
Client[Client] --> API[API server]
View[View event] --> API
API -->|ZINCRBY| Redis[(Redis · one giant ZSet)]
API -->|ZRANK / ZREVRANGE| Redis
One sorted set for all 12B videos. ZINCRBY on each view.
ZRANK(video_id) for rank lookup. ZREVRANGE rank rank for
item_at(rank). Range queries via ZREVRANGE from to.
Why this fails:
- Memory. ~1 TB in a single ZSet. Even sharded across a Redis cluster, every write is O(log N) against a 12-billion-element skiplist.
- Hot write path. 1M/sec ZINCRBYs on a single sorted set is
impossible. Sharding ZSets by video ID breaks
ZRANK— rank is a global property, not shardable by key. - Hot keys. A viral video sends bursts of writes to one shard.
- Read amplification.
ZRANKon a sharded ZSet requires computing ranks across all shards and summing offsets — expensive per query.
Call it out quickly: “A single sorted set is the obvious first answer and it breaks at roughly 10M videos or 10K writes/sec. Let’s separate the hot path from the cold path.”
Stage 2: Tier the storage by rank range
Key insight: rank lookup for the top 100K is an entirely different problem from rank lookup for the 1-billionth item. Top positions are hot, small, need exactness; the tail is cold, enormous, tolerates approximation. Don’t force one structure to serve both.
flowchart LR
View[View event] --> API[API server]
API -->|ZINCRBY| RedisTop[(Redis · top 100K ZSet)]
API -->|UPSERT count| DB[(OLAP store · full 12B)]
Query[Query] --> Router{Rank range?}
Router -->|top 100K| RedisTop
Router -->|deeper| DB
classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
class DB,Router new
Two tiers:
- Hot tier: Redis sorted set for top 100K. Exact, fast, small
(~8 MB).
ZINCRBYon writes;ZRANK/ZREVRANGEon reads. Videos that fall out of the top 100K are evicted from this set. - Cold tier: OLAP column store (ClickHouse) for all 12B. One row
per video:
(video_id, view_count, updated_at). Indexed onview_countfor range queries.
For a rank query:
- If
video_idis in the Redis top-100K set → return its exact rank from the ZSet. - Otherwise → go to the OLAP store. Use the counting trick:
SELECT count() FROM videos WHERE view_count > (
SELECT view_count FROM videos WHERE video_id = 'abc'
)
The answer + 1 is the rank. This reframes “find the position in a sorted list” as “count items above a threshold” — which OLAP stores handle well.
Complexity of the counting trick. This is not O(log N). It’s O(N/P) parallel scan for ClickHouse, or O(log N + K) for Elasticsearch BKD trees (K = number of matches). For a video in the top 1000, K is small and the query is fast. For a video deep in the tail, K is in the billions and the query is slow. A naive counting query at the tail can take seconds.
What’s still broken:
- Tail rank queries are too slow for online use.
item_at(rank)for huge rank values requiresORDER BY ... LIMIT 1 OFFSET Nwhere N is billions — infeasible.- Writes at 1M/sec to ClickHouse are too much for per-event updates.
Stage 3: Add Kafka + stream aggregation for writes, precompute tail ranks offline
flowchart LR
View[View event] --> Gateway[Ingestion gateway]
Gateway --> Kafka[(Kafka · partitioned by video_id)]
Kafka --> Flink[Flink · aggregate per video_id]
Flink -->|hot videos| RedisTop[(Redis · top 100K)]
Flink -->|all videos| DB[(ClickHouse · 12B rows)]
Kafka -.raw.-> S3[(S3 · raw event archive)]
Batch[Nightly Spark job] --> DB
Batch --> RankIndex[(Rank index · per-rank-range files)]
Batch --> Histogram[(View-count histogram)]
Query[Query] --> Router{Query shape?}
Router -->|top 100K rank / range| RedisTop
Router -->|mid-tier rank| DB
Router -->|deep-tail rank| Histogram
Router -->|item_at rank| RankIndex
classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
class Gateway,Kafka,Flink,S3,Batch,RankIndex,Histogram,Router new
Now the write path is properly decoupled and the tail is precomputed.
Write path:
- Raw view events go to Kafka, partitioned by
video_idso a given video’s updates are serialized. - Flink aggregates per video over a small window (e.g., every 10
seconds): collapse N views into a single
+Nupdate. - Flink writes the aggregate to Redis (if the video is in the top 100K) and to ClickHouse. Raw events go to S3 for replay.
Read path, by tier:
- Top 100K. Redis ZSet.
ZRANK/ZREVRANGE. Exact, ms. - Mid-tier (100K to 10M). ClickHouse counting trick. The key
optimization: precomputed
view_counthistogram. Divide the view-count range into buckets (say 100K buckets by log-scale), and maintaincount(videos) per bucket. Rank estimate:- Video has view_count = 5,000.
- Histogram: buckets above view_count 5,000 sum to 54,310.
- Rank ≈ 54,311.
- Updated every few minutes in the background. Accuracy within the bucket’s resolution (~100 videos).
- Deep tail (10M+). Pure histogram estimate. Accept rank error of hundreds or thousands — no one cares about the difference between rank 1,234,545,678 and 1,234,545,800.
item_at(rank). Precomputed daily by a Spark job: sort all videos by view_count, partition into fixed-size files (1M ranks per file). Lookup =file_index = rank / 1M, read file, return the item at offsetrank % 1M. Daily freshness.
The histogram is the key trick. It converts “rank lookup in a billion-row table” from “count billions of rows” to “sum 100K precomputed bucket counts” — milliseconds instead of seconds.
What’s still limited:
- Mid-tier counting trick can still be slow if the histogram is coarse. Typically fine.
item_at(rank)for huge ranks is at most a day stale.- Top-100K membership changes require eviction logic when a video’s count drops out.
Stage 4: Close the loop — incremental histogram, approximate tail with sketches
Last optimization pass, only if the interviewer pushes further.
- Incremental histogram maintenance. Instead of recomputing the histogram every few minutes via full scan, maintain it incrementally from the Kafka stream: each aggregate update decrements the old bucket and increments the new bucket. Histogram stays fresh within seconds.
- Quantile sketches for very deep tail. For ranks past 100M, even the histogram is overkill. A t-digest or KLL sketch over the view_count distribution gives rank estimates with logarithmic memory. These structures answer “at what rank is view_count X” directly, without a bucket scan.
- Neighbors queries.
neighbors(video_id, ±10)for a top-100K video is free from Redis. For a mid-tier video, the counting trick gives the rank, then a second querySELECT ... WHERE view_count BETWEEN (count - ε) AND (count + ε) ORDER BY view_count LIMIT 20returns neighbors. ε is chosen from the histogram density.
This is the final design. Top positions are exact and live; the middle is precomputed to milliseconds via histogram; the tail is approximate and acceptable.
5. Why no single structure works
A common misconception: “just use a sorted index on view_count.”
It’s instructive to enumerate why each candidate structure fails alone:
- B-tree / RDBMS index on
view_count. Write amplification at 1M/sec is too high;COUNT(*) WHERE view_count > Xscans the tail;ORDER BY ... OFFSET Nis linear. - Redis sorted set. Memory is ~1 TB; single-key bottleneck on writes; sharding doesn’t preserve global rank semantics.
- Elasticsearch. BKD tree on numeric fields handles range
queries well (O(log N + K)), and it’s often the right answer for
the mid-tier. But at 12B documents the index is huge and reindex
costs dominate; deep
frompagination is explicitly capped (max_result_window). - ClickHouse alone. Great for counting queries, but 1M/sec
per-event writes are expensive;
ORDER BY ... LIMIT OFFSETat billion-scale offsets is still slow. - Count-Min Sketch / Heavy Hitters. Great for “what’s popular,” useless for “what’s the rank of this specific item” — the sketch can’t enumerate and Heavy Hitters only knows its top-m.
Each structure is right for one part of the query surface. The design lesson: map each query class to the structure that serves it, and let the API router dispatch.
6. The counting trick in detail
This is the move that turns rank lookup into a tractable query. Worth explaining carefully because it’s the non-obvious part.
Naive approach: “Find video_abc in the sorted list and count
how many items are before it.” This requires either the full sorted
list (infeasible) or an ORDER BY ... OFFSET N scan.
Counting approach:
- Look up the video’s own view_count: O(1) KV lookup.
- Count how many other videos have view_count strictly greater.
- That count + 1 is the rank.
Step 2 is a range aggregate, which is what columnar stores and inverted indexes are built for.
- ClickHouse:
SELECT count() FROM videos WHERE view_count > X. Columnar scan, parallelized, with sparse min/max index to skip data parts. For top-range queries (X is high) very fast; for bottom-range queries slower but still parallel. - Elasticsearch:
countAPI with a range query onview_count. BKD tree narrows to matching documents. O(log N + K) where K is the match count. - Precomputed histogram:
SELECT sum(count) FROM histogram WHERE bucket > bucket_of(X). Millisecond-level. Used for the mid-tier.
The fastest version always wins: precompute the histogram, answer from the histogram when acceptable, fall back to the live count query when not.
7. Data model and storage
views_live (Redis · hot)
top 100K videos as a ZSet, score = view_count
video_counts (ClickHouse · all 12B)
video_id, view_count, updated_at
MergeTree, ORDER BY (view_count DESC, video_id)
index on video_id for point lookups
count_histogram (ClickHouse materialized view, or Redis)
bucket (log-scale of view_count), cumulative_count
updated incrementally or via periodic refresh
rank_index (S3 · precomputed daily)
sorted parquet files, 1M ranks per file
lookup: file_idx = rank / 1M
raw_events (S3 · archive)
partitioned by hour, used only for replay
Storage choices:
views_live— Redis ZSet. Sub-millisecond rank / range queries for the top 100K. Eviction via a periodic job that compares current top 100K from ClickHouse and trims.video_counts— ClickHouse. All 12B rows.view_countordering supports efficient counting and ranges. Updated from Flink aggregates at batched write rates (not per event).count_histogram— either a ClickHouse materialized view or a small Redis hash (bucket → count). Histograms are tiny (~100K buckets × 16 bytes = ~1.6 MB). Feeds the mid-tier rank query.rank_index— S3 parquet, precomputed by a nightly Spark job. Servesitem_at(rank)for deep ranks. Read in via a lightweight API server that knows the file layout.raw_events— S3. Not on the query path. Used for recomputation if the ranking dimension changes or for audit.
Separation of concerns: online queries read from Redis /
ClickHouse / histogram; offline queries (deep item_at) read from
S3. No online query hits raw events.
8. The write path
flowchart LR
View[View event] --> Gateway[Ingestion gateway]
Gateway --> Kafka[(Kafka · partitioned by video_id)]
Kafka --> Flink[Flink · aggregate + batch]
Flink -->|+N batch| Redis[(Redis · top 100K)]
Flink -->|+N batch| DB[(ClickHouse · all 12B)]
Flink -.->|histogram delta| Hist[(count_histogram)]
Kafka -.archive.-> S3[(S3)]
- Kafka partitioned by
video_idso updates to a given video serialize. - Flink aggregates over 10-second windows: one video with 1000
views in a window becomes one
+1000update, not 1000 increments. Reduces write load to ClickHouse by orders of magnitude. - The aggregation also emits histogram deltas: when a video moves
from bucket B1 to B2, emit
(B1, -1)and(B2, +1)tocount_histogram. This keeps the histogram live without full recomputation. - Eviction from Redis top 100K happens in a background job that trims the ZSet to size 100K periodically. (Redis ZSets grow unbounded under ZINCRBY; you have to prune.)
9. Distribution and hot keys
- Sharding ClickHouse. Shard by
video_id. Queries for a specific video hit one shard; range queries onview_countfan out to all shards and merge. TheORDER BY (view_count, video_id)within each shard keeps per-shard ranges cheap; the fan-out merge is bounded by shard count, not data size. - Hot-key writes. A viral video sends 10% of the event stream
to one Kafka partition. Because Flink aggregates per window, 1M
views in 10 seconds collapse to a single
+1,000,000update — hot keys are actually easier in this architecture than in the top-K one, because the aggregation absorbs the burst before it hits storage. - Hot-key reads. “What’s the rank of the top video” is a common query. Served from Redis ZSet in microseconds. Cache the full top-100 result at the API layer with a few-second TTL.
10. Failure modes
- Redis top-100K loss. Reconstruct from ClickHouse. A few minutes of rebuild; during rebuild, top-100K queries fall back to ClickHouse with a warning flag in the response.
- ClickHouse shard down. Count queries return partial results.
Degrade to histogram-based approximate rank until the shard
recovers. For mid-tier queries, mark the response as
confidence: degraded. - Histogram staleness. If incremental updates fall behind, the mid-tier rank is slightly stale. Monitor histogram lag; if above threshold, force a refresh from ClickHouse.
- Kafka partition unavailable.
acks=all,min.insync.replicas=2. Single broker loss invisible. Double loss stalls writes for events keyed to that partition; gateway rejects with retry. - Flink task crash. Restart from last checkpoint. Aggregates in flight during the crash window are lost — bounded, small, and dwarfed by histogram bucket resolution.
- Rank index (S3) stale. Up to a day stale by design.
item_at(rank)for deep ranks is labeledconfidence: approxwith the index timestamp in the response. - Boundary flapping for top-100K eviction. A video bouncing in/out of rank 99,999 / 100,001 causes repeated Redis ZSet churn. Mitigation: hysteresis — evict only when rank drops below 105K; admit only when rank rises above 95K. Smoother boundary.
The pattern: every tier has a graceful degradation story. Top
tier falls back to mid tier; mid tier falls back to histogram;
histogram falls back to last-known batch. The response always says
confidence: ... so the consumer knows what they got.
11. Tradeoffs and alternatives
- Single massive Redis cluster. Mentioned in stage 1. Works to maybe 100M entries, melts past that. Too expensive at 12B.
- Single Elasticsearch index on
view_count. Viable for hundreds of millions, but reindex costs and deepfrompagination caps make it unsuitable for the full 12B with deepitem_atqueries. - Full offline precomputation (batch only). Every rank query served from daily-precomputed parquet files. Fails the freshness SLA for the top 100K by orders of magnitude. Mentioned to explicitly reject.
- Approximation everywhere (sketches for all ranks). Works for analytics, fails for the user-facing leaderboard top-100, which must be exact.
12. Extensions and what I’d skip
Follow-ups worth a few sentences:
- Multi-dimensional ranking. Rank by (views, recency, engagement score). Composite scores change at every event. Precompute materialized views per ranking dimension; the tiered design generalizes.
- Personalized ranking. “Top videos for me.” One ranking per user is infeasible; segment users into a few hundred interest buckets, maintain per-bucket rankings. Approximate personalization at feasible cost.
- Historical ranking. “What was this video’s rank a year ago?” The raw event archive in S3 supports it via batch replay. Not online-query-shape.
- Percentile queries. “What’s my percentile among all videos?” Served directly from the histogram — percentile is the histogram CDF, no rank computation needed.
- Ranking windows. “Top videos this week” vs all-time. Each window is its own materialized view; tier the same way.
Explicit defers:
- Spam / fraud filtering on views. Upstream system.
- Tie-breaking for equal view counts. Use video_id as secondary sort; mention it’s a correctness detail.
- Real-time rank change notifications (“your video just entered the top 100”) — a consumer of the Redis ZSet change stream, not part of this system.
13. Wrap-up
One crisp sentence:
The design tiers storage to match the query surface: exact and live for the top 100K via Redis, counting queries against ClickHouse with a precomputed histogram for the mid-tier, and offline-precomputed parquet for the deep tail — with every response carrying a confidence label so consumers know what they got.
What separates SDE II from SDE III on this question
- SDE II lands a sorted set or an indexed SQL table, recognizes it doesn’t scale, and struggles to articulate the right decomposition.
- SDE III recognizes immediately that the query surface has
heterogeneous access patterns and decomposes by rank range.
Articulates the counting trick as the reframe from “position in
list” to “count above threshold.” Names histograms as the
mid-tier accelerator. Names precomputed parquet for deep
item_at. Labels every response with a confidence field so the consumer contract stays honest. Names hysteresis for boundary flapping and graceful degradation per tier.
The differentiator isn’t tool knowledge. It’s whether you can split a query surface by access pattern and map each split to the right storage shape, rather than searching for one magic structure.
Further reading
- ClickHouse documentation: MergeTree and sparse indexes — why columnar range aggregates are fast, and how the sparse primary index supports the counting trick.
- Elasticsearch BKD trees — how Lucene indexes numeric fields for range queries. Relevant if you choose ES over ClickHouse for the mid-tier.
- t-digest paper (Dunning, 2019) — quantile sketch used for approximate rank / percentile queries on the deep tail.
- KLL sketch paper (Karnin, Lang, Liberty, 2016) — another quantile sketch with better theoretical guarantees.
- Designing Data-Intensive Applications — Kleppmann. Chapters 3 (storage and retrieval) and 10 (batch processing) cover the columnar and precomputation patterns this design depends on.
Related on calm.rocks:
- Walkthrough: Designing a Top K System — the sibling problem. Top-K tolerates approximation everywhere; full ranking forces tiered exactness. Comparing the two walkthroughs is the fastest way to see why query surface drives storage choice.
- Walkthrough: Designing an Ad Click Aggregation System — same streaming ingest pattern with a different aggregation and correctness story.
- Reference: Cache Access and Invalidation Patterns — the top-100K Redis tier is a canonical read-through cache with scheduled eviction.