Skip to main content

Walkthrough: Designing a Post Search System

A candidate's-eye walkthrough of the post search system design question — indexing, sharding, two-stage ranking, visibility, freshness, and the failure modes that matter.

The problem

You’re asked to design the search backend for a social network — Facebook post search, Twitter search, Reddit search. The interviewer says something like: “Design a system that lets a user type a query and get back posts that match, ranked by relevance, including posts from their friends as well as public content.”

Sounds like a search engine. It’s three systems pretending to be one. This is the canonical “social-scale search” problem, and the traps are in three places at once. Write throughput dominates — posts are created orders of magnitude faster than webpages, so indexing pipeline design matters more than query optimization. Ranking is not pure text relevance — recency and social affinity dominate BM25 almost entirely, and getting that wrong ships a product that technically works and nobody uses. And visibility is a first-class correctness constraint — a post set to “friends only” must appear in the searcher’s results if-and-only-if they’re a friend, and getting that wrong is not a bug, it’s an incident. The interviewer is watching for whether you can keep retrieval, ranking, and visibility named and separate — or whether you collapse them into “just use Elasticsearch.”

Below is how I’d walk through this, roughly in the order I’d speak the words.

1. Clarify before you design

First 3–5 minutes. Resist the urge to start drawing.

Questions I’d ask:

  • Corpus characteristics. Total post count, posts per day, average length? 100B total posts and 1B/day new is the Facebook-realistic anchor. Changes the indexing pipeline materially.
  • Query shape. Keyword only, or filters too (author, date range, location, media type)? Filtered search requires a different index shape than full-text.
  • Freshness SLA. How soon after posting must a post be searchable — seconds, a minute, an hour? Near-real-time indexing is a different architecture from hourly batch.
  • Ranking signals. Pure text relevance, or social signals (recency, author affinity, engagement)? Pure text is rarely the right answer on a social product.
  • Visibility rules. Public posts, friends-only, private groups, page-level privacy? Who is allowed to see what, and when is that filter applied?
  • Query QPS. Search traffic is usually 10–100× smaller than ingestion traffic. ~10K searches/sec peak is a common anchor.
  • Language and geo. One language, or multilingual? Tokenization and stemming depend on language.
  • Autocomplete. Part of this system, or separate? Usually separate, worth confirming.

Freshness and visibility matter most — they define the pipeline shape and the correctness contract. Ranking signals matter because they reshape what “retrieval” even means.

Say the interviewer confirms: ~100B total posts, ~1B/day new, ~10K searches/sec peak, filters on author + date + location, 30-second freshness SLA, ranking by recency + social affinity + engagement, visibility rules (public / friends / private), English-primary multilingual, autocomplete deferred to a separate system.

2. Capacity estimate

Brief. The point is to size the problem, not to be precise.

  • 1B posts/day × ~500 bytes = ~500 GB/day raw post storage, ~180 TB/year, ~100B posts × 500 bytes over lifetime ≈ ~50 TB of raw post content.
  • Inverted index ~30% of raw size (compressed postings) → ~15 TB live index across the fleet.
  • 10K searches/sec peak. Each search touches multiple index shards — effective shard-level QPS is ~10K × shard-count-per-query, easily 100K+ shard-hits/sec.
  • Ingestion at 1B/day ≈ ~12K posts/sec average, ~50K/sec peak.

I’d say out loud: “This tells me three things. One — ingestion rate is within an order of magnitude of query rate, so indexing throughput is a first-class design constraint, not a background problem. Two — the live index is tens of TB, which means it has to be sharded; no single node holds it. Three — query latency will be dominated by the slowest shard in a scatter-gather, so tail latency control is a real design concern.”

3. API design

Two surfaces. Search (user-facing, latency-sensitive) and indexing (from the post-creation pipeline, high throughput).

POST /search
  body:    { query, filters{author?, date_range?, location?},
             cursor?, limit?, searcher_context }
  returns: { results: [{post_id, snippet, score, ...}],
             next_cursor, total_approx }

POST /index  (internal, called by post-creation pipeline)
  body:    { post_id, author_id, content, created_at,
             visibility{type, audience_id?}, metadata{...} }
  returns: 202 Accepted

Three decisions worth calling out:

  • searcher_context on the search body. The viewer’s identity and relevant social-graph context (friend IDs) are part of the query contract, not a post-filter. Visibility is enforced at retrieval time, not after. This is the single largest decision in the whole design.
  • 202 on index. Indexing is asynchronous; the contract is “eventually searchable within the SLA,” not “searchable on write.”
  • total_approx, not total. Exact result counts at this scale are expensive and nobody needs them. An approximate count (“about 1,200 results”) is free from the inverted index, and that’s all the UX needs.

4. Core design: the retrieval architecture

This is the question. It’s where SDE II and SDE III answers diverge.

Three common approaches:

(a) Pure inverted index. A standard text-search engine (Elasticsearch, Solr, Vespa). Tokenize posts, build an inverted index, query with BM25, return top-K.

  • Pros: Mature tooling, well-understood, handles full-text recall well.
  • Cons: Ranks by text relevance, which is the wrong signal on social content. Filtering by visibility after ranking is catastrophic at scale.

(b) Inverted index plus auxiliary ranking store. Same as (a) but maintain a separate per-post “forward index” of ranking signals (author ID, timestamp, engagement counts) that the ranker consults after retrieval.

  • Pros: Ranking decoupled from text index. Signals can be updated independently.
  • Cons: Two stores to keep consistent. Ranker still runs over whatever the inverted index returns — visibility is a post-filter.

(c) Hybrid two-stage retrieval with visibility-aware scoping. Stage 1: inverted index returns a large candidate set (top 1,000 by BM25 + recency boost) scoped to posts the searcher can see. Stage 2: a learned ranker scores the candidates using social signals, returning the top K.

  • Pros: Separates recall (index) from precision (ranker) from visibility (scoping). Each concern has a dedicated system; each can scale and evolve independently.
  • Cons: More moving parts; three systems to keep coherent.

I’d pick (c) for this problem. The justification is topic-specific: social search is not web search. The signal users actually care about is “posts from people I know, recently, about this topic,” in roughly that order. Text match is the gate; social signals are the ranker. Visibility is not a filter — it’s part of the retrieval contract, because post-filtering at rank time means the system has to score 10× more posts than it returns, wastes compute, and leaks latency. This mirrors the two-stage retrieval pattern from the RAG walkthrough — retrieve broadly, rank precisely, scope before you rank.

Baseline pipeline:

flowchart LR
  User[User query] --> SearchAPI[Search API]
  SearchAPI --> Retrieval[Retrieval · inverted index + visibility scope]
  Retrieval --> Ranker[Learned ranker]
  Ranker --> Results[Ranked results]

  Posts[New post] --> Indexer[Indexing pipeline]
  Indexer --> InvIdx[(Inverted index)]
  Indexer --> Forward[(Forward index · signals)]
  Indexer --> ACL[(Visibility index)]
  Retrieval --> InvIdx
  Retrieval --> ACL
  Ranker --> Forward

Everything else in this walkthrough is a deep dive on one of these boxes.

5. Data model and storage

Four stores, each doing what it’s good at.

posts              (source of truth)
  post_id (PK), author_id, content, created_at, updated_at,
  visibility_type, visibility_audience_id, media_refs, metadata

inverted_index     (text search)
  term → [posting: {post_id, position, frequency, timestamp}, ...]
  sharded by document (post_id range or hash)

forward_index      (per-post ranking signals)
  post_id → {author_id, created_at, like_count, comment_count,
             share_count, engagement_rate, ...}

visibility_index   (who can see what)
  post_id → visibility_set  (or inverse: user_id → accessible_post_ids)

Storage choices:

  • posts — a relational or document store (Postgres, MongoDB) is the editable source of truth. Updates to a post must propagate to the other three stores; this is the write that fans out.
  • inverted_index — a dedicated search engine (Elasticsearch, Vespa, or a Lucene-based custom stack). Explicitly named: the architecture doesn’t care which; Lucene is the underlying primitive in all three, and the interesting decisions are sharding and near-real-time behavior, not the vendor.
  • forward_index — a KV or wide-column store (Cassandra, DynamoDB, Redis at hot tier). Point lookup by post_id returning a flat bag of signals. This is exactly the KV sweet spot.
  • visibility_index — the tricky one. For public-only systems this is trivial (every post is public). For social graphs, options diverge (see §9).

I’d say explicitly: “These are four cooperating stores, not one search database. The interviewer is testing whether you can keep them separate in your head — because every one of them has different access patterns, different update cadences, and different scaling profiles.”

6. Deep dive: the indexing pipeline

Ingestion is a streaming pipeline. Post creation writes to posts, then a change feed drives indexing into the other three stores.

Extending the architecture:

flowchart LR
  Post[Post creation] --> PostDB[(posts DB)]
  PostDB --> Kafka[(Kafka · post events)]
  Kafka --> Indexer[Indexer · tokenize + analyze]
  Indexer --> InvIdx[(Inverted index · hot segment)]
  Indexer --> Forward[(Forward index)]
  Indexer --> ACL[(Visibility index)]

  User[Search query] --> SearchAPI[Search API]
  SearchAPI --> Retrieval[Retrieval]
  Retrieval --> InvIdx
  Retrieval --> ACL
  Retrieval --> Ranker[Ranker]
  Ranker --> Forward

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class Kafka,Indexer new

The interesting problem is freshness. Lucene-based engines use segment-based indexing: writes go to an in-memory segment that is periodically flushed to disk and merged with older segments. Segments are immutable once written; deletions are marked as tombstones.

Near-real-time (NRT) search means the in-memory segment is queryable before it’s flushed. A post indexed at T=0 is searchable at T=100ms, not T=60s. The trade-off is that the in-memory segment is expensive to scan and limits how frequently the indexer can commit; in practice, NRT pipelines commit every 1–5 seconds and queries federate across the in-memory + disk segments.

At 50K/sec peak ingestion, the indexer pipeline needs to be parallelized across many indexer instances, each owning a subset of shards. Kafka partition-by-post_id gives us the parallelism for free; each indexer consumes its partitions independently.

7. Deep dive: sharding strategy

The inverted index is tens of TB. No single node holds it; sharding is mandatory. Two sharding strategies, with materially different trade-offs:

Document sharding (by post_id). Each shard holds a subset of posts, plus the inverted index over just those posts. Queries scatter-gather: the coordinator fans the query out to every shard, each returns its local top-N, the coordinator merges.

Term sharding (by token). Each shard owns specific tokens and their full posting lists. Queries touch only the shards for the tokens in the query.

DimensionDocument shardingTerm sharding
Write throughputExcellent — writes go to one shardPoor — every write touches multiple shards (one per token)
Query latencyScatter-gather over all shardsOnly the query’s tokens’ shards
Multi-word queriesEach shard scores locally, mergeIntersection across term shards
Hot partitionsNatural spreading by post_id hashCommon tokens become hot shards
ScalingAdd more shards; rebalance postsAdd more shards; rebalance tokens

Document sharding is the standard choice for web-scale search (Elasticsearch defaults to it, Google uses it). Term sharding is historically interesting but rarely the right answer at modern write rates. I’d commit to document sharding, and name the tail-latency cost as the thing to mitigate: the slowest shard determines query latency, so shard-level p99 matters more than cluster-level p50.

Mitigations for tail latency:

  • Request hedging. Send the query to a shard replica after the p95 latency mark if the primary hasn’t returned.
  • Shard-level timeouts with partial results. Return what came back in 200ms; drop slow shards. The UX is “your results might be slightly incomplete,” which beats a timeout error.
  • Replica selection by load. Route to the least-loaded replica of each shard, not round-robin.

8. Deep dive: two-stage ranking

Retrieval and ranking are separate. Retrieval returns a recall-optimized candidate set; ranking picks the top-K from it with richer signals.

Stage 1: retrieval

The inverted index returns the top ~1,000 candidates for the query using:

  • BM25 — standard text-relevance score. Baseline recall.
  • Recency boost — a time-decay factor applied to BM25. A post from an hour ago beats a textually-identical post from a year ago for most social queries.
  • Filter predicates — author/date/location filters applied efficiently at the posting-list level.

1,000 candidates is enough recall that the ranker has material to work with, but small enough that the ranker’s per-candidate cost is bounded.

Stage 2: ranking

A learned ranker (gradient-boosted trees like LightGBM, or a shallow neural net) scores each candidate using features the forward index provides:

  • Social affinity. Is the author a friend? A friend-of-friend? Interacted-with recently?
  • Engagement signals. Post like/comment/share counts, recent velocity.
  • Query-post match features. BM25 itself, title-match, hashtag match.
  • Content features. Media type, language, post length.
  • Searcher features. Language, region, recent search history.

The ranker outputs a score per candidate; the top K (say 20) are returned. The ranker model is trained offline on click-through data and deployed as a serving artifact.

I’d say out loud: “The ranker is where the product lives. Retrieval is plumbing. At interview scope I’ll name the features and the training signal, but I’ll defer the full ML system design — model training, feature pipelines, online eval — to a sibling system design question.”

9. Deep dive: visibility and ACLs

The hardest problem in the whole system, and the one most candidates underprepare. A naive design runs the query, ranks the results, then filters by visibility. At scale this is catastrophic: if only 10% of your candidates are visible to the searcher, the ranker wastes 90% of its compute, and worse, the top K returned might be empty after filtering.

The correct design scopes visibility at retrieval time, before ranking.

Three techniques, typically combined:

Technique 1: index-level visibility tags. Each posting in the inverted index carries a visibility scope field: public, or a small-cardinality identifier (friends_of_user_X, group_Y_members). The query passes the searcher’s identity + relevant group IDs, and the index returns only postings whose scope intersects the searcher’s set. Works cleanly for public + group-scoped posts.

Technique 2: per-user index shards for restricted content. For “friends-only” posts — where the audience is a set that varies per poster — maintain a smaller, denser index scoped to each user’s “friends’ posts” and query it in parallel with the public index. The coordinator merges. This is how Meta’s Unicorn system scales friend-graph search.

Technique 3: post-hoc visibility check as defense-in-depth. Even with scoping, run a final visibility check before returning results. Never rely on a single filter to protect private content; compliance and trust demand belt-and-suspenders.

Extending the architecture once more:

flowchart LR
  Post[Post creation] --> PostDB[(posts DB)]
  PostDB --> Kafka[(Kafka)]
  Kafka --> Indexer[Indexer]
  Indexer --> InvIdxPub[(Public inverted index)]
  Indexer --> InvIdxFriends[(Per-user friend-scoped index)]
  Indexer --> Forward[(Forward index)]
  Indexer --> ACL[(Visibility index)]

  User[Search query] --> SearchAPI[Search API]
  SearchAPI --> Retrieval[Retrieval coordinator]
  Retrieval --> InvIdxPub
  Retrieval --> InvIdxFriends
  Retrieval --> Merge[Merge + ACL check]
  Merge --> ACL
  Merge --> Ranker[Ranker]
  Ranker --> Forward
  Ranker --> Final[Final ACL check]

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class InvIdxFriends,Merge,Final new

The failure mode to name explicitly: visibility leak. If a private post appears in a stranger’s results, that’s not a bug to fix later; that’s a reportable incident. The three-layer design (index scoping + ACL lookup + final check) is expensive precisely because one layer is not enough.

10. Failure modes

I’d proactively walk through what breaks. This is one of the clearest differentiation signals.

  • Indexer backlog. Kafka absorbs; new posts aren’t searchable until the indexer catches up. Freshness degrades; search still works. Mitigation: alarm on indexer lag; scale indexer instances horizontally; partition-level backpressure.
  • Shard loss. One index shard goes offline. Scatter-gather now returns partial results. Mitigation: replicate every shard 3×; return partial results with a staleness indicator rather than failing the query.
  • Hot-query DDoS. A trending query (“celebrity name” post-controversy) melts specific shards. Mitigation: query-level caching at the edge for the top 1% of queries; short TTL (seconds).
  • Visibility leak. The worst failure mode. A bug in the scoping layer exposes a private post to a non-friend. Detection: adversarial eval set of known private posts, run continuously in production, page on any positive. Recovery: immediate rollback of the suspect change; audit log of every search containing leaked content.
  • Ranker model regression. A new model ships with a scoring bug; top results go stale. Mitigation: canary rollout, online A/B with a small fraction of traffic; automatic rollback if engagement metrics drop past a threshold.
  • Stale forward index. Engagement counts lag; ranker uses outdated signals. Usually acceptable (approximate ranking); becomes incident only if completely frozen. Mitigation: alarm on forward-index write lag; gossip updates from the engagement service directly.
  • Clock skew on post timestamps. Recency boost relies on created_at; a client with wrong clock injects a “post from the future” that dominates rankings. Mitigation: server-stamp the authoritative timestamp; cap recency boost at the server clock.

The pattern to notice: name what fails, name what degrades gracefully, name what doesn’t. Visibility leak is the “what doesn’t” — every other failure has a graceful degradation, visibility doesn’t.

11. Tradeoffs and alternatives

Briefly, three alternatives I considered and rejected.

  • Putting everything in one Elasticsearch cluster. Rejected because it conflates retrieval, ranking, and visibility. You can make ES do all three, but the ranker and the visibility scoper are better served as separate systems. Using ES just for inverted-index retrieval is the right scope.
  • Precomputing per-user search results. For a fixed set of popular queries, sometimes worth it — but at user × query cardinality it’s an explosion. Mentioned only because candidates occasionally propose it.
  • Serving from the main posts DB with LIKE queries. Works at small scale, dies above ~1M posts. Worth naming so the interviewer knows you know it’s not the answer.

12. Extensions and what I’d skip

Likely follow-ups worth a few sentences on each:

  • Cross-region. Per-region ingestion and per-region indexes; searches route to the searcher’s region. The visibility index is global (or globally replicated) because friend graphs are global. Cross-region is mostly a problem about replicating raw posts, which is the easy half.
  • Autocomplete. A sibling system, typically a separate in-memory trie / FST over the query log. Not the same engine as full-post search. Worth naming as a distinct design question.
  • Personalization beyond friend graph. Interest-based ranking (the viewer likes sports posts more than politics) is a ranker feature, not a separate system. Named.
  • Trending-query cache. The top 0.1% of queries get sub-second cached responses. Belongs at the edge, refreshed every few seconds.

Explicit defers:

  • Full ML ranking pipeline. Feature store, training pipeline, online eval, model monitoring. An entire separate design question.
  • Spam and abuse filtering at ingestion. Real; deferrable.
  • Localization and cross-lingual search. Real; deferrable.
  • Query-understanding layer (entity recognition, intent classification). A pre-retrieval layer in mature systems; a future v2.

13. Wrap-up

One crisp sentence before the interviewer’s next question:

This design treats post search as three cooperating systems — retrieval, ranking, visibility — each with its own scaling profile. The architectural wins all come from keeping them separate; the architectural failures all come from collapsing them back into one.

That’s the kind of framing that lands.

What separates SDE II from SDE III on this question

  • SDE II usually lands an Elasticsearch cluster, describes tokenization and BM25, proposes sharding by post_id, and sketches a visibility post-filter.
  • SDE III drives scoping in the first five minutes, separates retrieval from ranking as a first-class design decision, names visibility as a retrieval-time constraint and walks through the three-layer scoping pattern, handles tail latency in scatter-gather explicitly, and names visibility leak as an incident-class failure mode with continuous adversarial eval.

The differentiator isn’t tool knowledge. It’s whether you treat this as “search” or as “search plus ranker plus ACL, three systems cooperating.”

Further reading

Related on calm.rocks: