Walkthrough: Designing an Ad Click Aggregation System

The problem

You’re asked to design the backend that powers an ad network’s click analytics — Google Ads, Meta Ads, the Trade Desk. The interviewer says something like: “Design a system that ingests ad click events and lets advertisers query, in near real time, how many clicks their ads got — grouped by ad, campaign, country, time bucket.”

Sounds like an analytics dashboard. It isn’t. This is the canonical “streaming aggregation at scale” problem, and the traps are in three places at once: exactly-once counting (clicks drive billing, so duplicates and losses are both expensive), out-of-order and late events (mobile clients buffer, then flush), and the tension between a fast streaming path and a correct batch-reconciled truth. The interviewer is watching for whether you treat this as a Lambda-architecture problem or a naive “Kafka → DB” pipe.

Below is how I’d walk through this, roughly in the order I’d speak the words.

1. Clarify before you design

First 3–5 minutes. Resist the urge to start drawing.

Questions I’d ask:

Scale. How many clicks per day? Ad networks cite numbers anywhere from 100M/day (small DSP) to 10B+/day (Google). The architecture is different on each side of ~1B/day.
Query granularity. What’s the finest time bucket advertisers can query — by minute, hour, day? Finest granularity drives the aggregate row count.
Query dimensions. What can they group by? Ad, campaign, country, device, publisher, user segment? Every dimension is a potential cube face.
Freshness SLA. How fresh do aggregates need to be — seconds, minutes, an hour? This is the single biggest architecture driver.
Exactly-once or approximate? Clicks drive billing. A 1% duplicate rate is real money. This usually pushes toward exactly-once semantics, not at-least-once.
Late events. How late can a click arrive and still be counted? Mobile clients flush after reconnect; five minutes is common, an hour is realistic.
Fraud/invalid filtering. Do we filter bots and invalid clicks in the hot path, or later? Affects where dedup lives.

The freshness SLA and late-event tolerance matter most — together they define the shape of the windowing system. The billing implication matters because it closes the door on approximate-counting shortcuts.

Say the interviewer confirms: ~10B clicks/day, query down to 1-minute buckets, group by ad × country × device, aggregates fresh within ~1 minute, exactly-once required (billing), late events accepted up to 1 hour, fraud filtering out of scope for this interview.

2. Capacity estimate

Brief. The point is to size the problem, not to be precise.

10B/day ≈ 115K events/sec average, ~1M/sec peak (ad traffic is diurnal and correlated with TV ad breaks, Super Bowl, etc.)
Event size ~500 bytes (ad_id, campaign_id, user fingerprint, geo, device, referrer, timestamp, signed token). 10B × 500 B ≈ 5 TB/day raw.
At 1-minute granularity × (say) 10M active ads × 200 countries × 5 device buckets the aggregate row space is ~10^10/day worst case, but sparsity collapses this to ~1% populated — call it ~100M aggregate rows/day.

I’d say out loud: “This tells me three things. One — ingestion is the firehose, not the query side. Two — we cannot scan raw events at query time, we must pre-aggregate. Three — storing 5 TB/day raw is fine for a month, expensive for a year; retention will matter.”

3. API design

Two surfaces. Ingestion (from ad servers, extremely high volume) and query (from advertiser dashboards, low volume but sensitive to latency).

POST /ingest/click
  body:    { event_id, ad_id, campaign_id, user_id, country, device,
             publisher_id, client_ts, signed_token }
  returns: 202 Accepted

GET /query/clicks
  params:  ad_id | campaign_id, from_ts, to_ts, granularity, group_by[]
  returns: [{ bucket_ts, dimensions{...}, click_count }, ...]

Two decisions worth calling out:

event_id is client-generated, server-validated. This is the idempotency key the entire exactly-once story hangs on. Without it, retries duplicate clicks and advertisers get billed twice. I’d make it a UUID the ad-server SDK generates at impression time.
202 Accepted, not 200 OK. Ingestion is asynchronous by design — we acknowledge receipt once the event is durably in Kafka, not once it’s aggregated. This is a deliberate SLA choice, worth saying aloud.

4. Core design: streaming aggregation with exactly-once semantics

This is the question. It’s where SDE II and SDE III answers diverge.

Three common approaches:

(a) Batch only. Dump raw events to S3, run hourly Spark job, write aggregates to a warehouse.

Pros: Simple, easy to make correct, cheap.
Cons: Fails the 1-minute freshness SLA by two orders of magnitude. Non-starter for the stated requirement.

(b) Naive streaming. Consumers read Kafka, increment counters in Redis or Cassandra using INCR-style operations.

Pros: Near-real-time.
Cons: At-least-once delivery means duplicates. Counters drift. On consumer crash and replay, you double-count. For billing data this is unacceptable.

(c) Windowed stream processor with checkpointed state. A stream engine like Apache Flink (or Kafka Streams) reads from Kafka, groups events into tumbling windows by (ad_id, country, device, 1-minute bucket), and emits aggregates to an OLAP store. Exactly-once is achieved through the combination of Kafka transactional producers and Flink’s checkpoint barriers — the read offset, the in-flight aggregate state, and the downstream write commit all move forward atomically.

Pros: Near-real-time, exactly-once, handles late events via watermarks.
Cons: Operationally heavier. State backend must be sized.

I’d pick (c), and I’d walk through why exactly-once actually works, not just name it — this is a topic where hand-waving is the failure mode.

How the exactly-once windowing actually works

Walk through a single event’s lifecycle:

Ingestion. Ad server posts {event_id: u-42, ad_id: A, country: US, device: mobile, client_ts: 10:00:15} to the /ingest API. Gateway signs the token, validates, produces to the clicks Kafka topic partitioned by ad_id. Kafka write is transactional.
Dedup. The Flink job reads from Kafka. A KeyedProcessFunction keyed on event_id checks a short-lived RocksDB-backed state (TTL 1 hour — matches our late-event window). If u-42 is already seen, drop. Otherwise store it and pass through.
Windowing. Event enters a tumbling window keyed by (ad_id, country, device), bucket size 1 minute. The window’s key here is (A, US, mobile, 10:00).
Watermark. Flink tracks a watermark — the timestamp below which we believe all events have arrived. With a 1-hour allowed lateness, the window for 10:00–10:01 closes at watermark ≥ 11:01, then emits the count.
Commit. The emitted aggregate ((A, US, mobile, 10:00) = 4,812) is written to the OLAP store via a transactional sink. The Kafka consumer offset advances in the same checkpoint as the state snapshot and the sink commit. On crash, we restart from the last checkpoint and replay exactly the events between that offset and the failure point — no double count, no loss.

The state we keep per window is just {(ad_id, country, device, bucket) → count} plus the dedup set. At 10M active ads this is ~100 GB of RocksDB state, comfortably sharded across Flink task managers.

Late events arriving after window close are either (a) dropped (if beyond allowed lateness) or (b) routed to a side output that triggers a correction write to the OLAP store. The batch reconciliation layer (§8) catches anything that slipped past.

5. Data model and storage

Three stores, each doing what it’s good at.

raw_clicks  (append-only event log, retention 30 days)
  event_id (PK), ad_id, campaign_id, user_id, country, device,
  publisher_id, client_ts, server_ts, signed_token

aggregates  (query-serving, 1-minute granularity)
  bucket_ts, ad_id, country, device  → click_count
  partition: (ad_id, bucket_day)
  clustering: (bucket_ts, country, device)

cold_raw  (S3/GCS, columnar parquet, retention 2 years)
  same shape as raw_clicks, partitioned by date + hour

Storage choices:

raw_clicks — Kafka itself, with 30-day retention, serves as the short-term event log. Kafka is already the ingestion buffer; making it also the replayable source of truth for the streaming job avoids a second write.
aggregates — an OLAP store. I’d pick Apache Druid or ClickHouse. The access pattern — “give me sum of click_count for ad_id X, time range T, grouped by country” — is exactly what columnar OLAP is built for. A relational DB collapses at 100M rows/day of insert plus sub-second group-by queries; a KV store can’t group-by at all.
cold_raw — Parquet on S3. Cheap. Queried by the batch reconciliation job and by ad-hoc analytics via Presto/Trino. The raw events are the audit trail that billing disputes resolve against — they’re the only store we promise is complete.

I’d say explicitly: “I want three tiers here because they answer three different questions — ‘what just happened’ (Kafka), ‘what do advertisers see on the dashboard’ (Druid), and ‘what is the truth’ (S3 raw). The streaming aggregate is fast but approximate; the batch reconciliation makes it correct.”

6. The write path

This is the firehose. Walk it end to end.

flowchart LR
  AdServer[Ad server / SDK] --> GW[Ingest gateway]
  GW --> Kafka[(Kafka · clicks topic)]
  Kafka --> Flink[Flink aggregator]
  Flink --> Druid[(Druid · aggregates)]
  Dashboard[Advertiser dashboard] --> QueryAPI[Query API]
  QueryAPI --> Druid

A few layers worth naming:

Ingest gateway. Stateless, horizontally scalable. Validates the signed token (prevents forged clicks), stamps server_ts, produces to Kafka. Returns 202 as soon as the Kafka write is acknowledged with acks=all.
Kafka clicks topic. Partitioned by ad_id — more on this in §7. Replication factor 3, min.insync.replicas=2. This is the durable buffer; if downstream is slow, Kafka absorbs.
Flink aggregator. Reads from Kafka, does the dedup → window → emit flow described in §4. Checkpoint interval 30 seconds; the checkpoint is what makes exactly-once real.
Druid. Ingests the aggregate rows. Advertiser queries land here, answered in sub-second from the columnar index.

Ingest gateway capacity at 1M events/sec peak with ~500-byte events is ~500 MB/sec network in, same out to Kafka. A few hundred gateway instances behind an L7 load balancer handles this; the bottleneck in practice is Kafka partition count, not gateway CPU.

7. Sharding and hot-key mitigation

Kafka’s clicks topic is partitioned by ad_id. Flink’s keyed state uses the same key. This gives us linear scaling — add partitions and Flink task managers together.

Hot-partition risk is real here. During the Super Bowl, one ad might get 10% of total traffic. A naive hash(ad_id) lands all of it on one partition, which pins one Flink task manager at 100% while the rest idle.

Mitigation: salt the hot keys. Detect ads in the top 0.01% by volume (a running count-min sketch in the gateway is enough), and for those, use ad_id + random_suffix(0..N) as the partition key. Downstream, the Flink job merges the salted sub-aggregates back into the real ad_id in a second stage. N is chosen per-ad based on observed load — typically 4–16 is plenty.

Extending the architecture:

flowchart LR
  AdServer[Ad server / SDK] --> GW[Ingest gateway]
  GW --> Kafka[(Kafka · clicks topic · N partitions)]
  Kafka --> Flink[Flink stage 1 · salted keys]
  Flink --> Flink2[Flink stage 2 · merge salt]
  Flink2 --> Druid[(Druid · aggregates)]
  Dashboard[Advertiser dashboard] --> QueryAPI[Query API]
  QueryAPI --> Druid

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class Flink2 new

The second Flink stage is key-partitioned on the unsalted ad_id and sums the N salted buckets for each 1-minute window. This costs us one extra shuffle but keeps the hot partitions from melting.

8. Anti-entropy: the batch reconciliation layer

This is where the Lambda-architecture instinct pays off. The streaming layer is fast but has failure modes that subtly corrupt aggregates — late events past the allowed lateness window, a bug in the dedup logic, a Flink job that was restarted from a slightly stale checkpoint during an incident. None of these are caught by the streaming layer itself, by construction.

The fix: an hourly batch job over cold_raw in S3 that recomputes aggregates for completed hours and overwrites the Druid rows it previously received from the streaming path.

flowchart LR
  AdServer[Ad server / SDK] --> GW[Ingest gateway]
  GW --> Kafka[(Kafka · clicks topic)]
  Kafka --> Flink[Flink stage 1]
  Flink --> Flink2[Flink stage 2 · merge salt]
  Flink2 --> Druid[(Druid · aggregates)]
  Dashboard[Advertiser dashboard] --> QueryAPI[Query API]
  QueryAPI --> Druid

  Kafka -.->|tiered storage| S3[(S3 · cold_raw parquet)]
  S3 --> Batch[Spark batch job · hourly]
  Batch -->|reconciled rows| Druid

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class S3,Batch new

How reconciliation actually works:

Kafka tiered storage (or a Kafka Connect S3 sink) lands raw events as Parquet in S3, partitioned by dt=YYYY-MM-DD/hour=HH/.
Every hour, a Spark job reads the hour that just closed, does the same dedup-by-event_id + group-by-(ad_id, country, device, bucket_ts) the streaming job does, and writes the result to Druid with an explicit override=true flag per row.
Advertiser queries over buckets older than ~1 hour always read the batch-written rows. Queries over fresher buckets read the streaming rows. The OLAP layer doesn’t distinguish — it just sees the most recent write.

The property this gives us: the streaming path can be wrong, and we’ll still be eventually correct. Billing reads from aggregates older than 1 hour, which means billing always reads reconciled data. Dashboard reads from the streaming tail, which is “probably right, within seconds-to-minutes.” That split matches what advertisers and billing actually need.

I’d say: “This is a Lambda architecture. Kappa is philosophically cleaner — just replay the stream — but for billing data where we have strict correctness requirements and cheap batch compute already, the explicit batch layer is worth its operational weight.”

9. Failure modes

I’d proactively walk through what breaks, because this is one of the clearest differentiation signals. Don’t wait for the interviewer to ask.

Ingest gateway overloaded. Shed load at the LB — return 503 for a slice of traffic. The ad-server SDK retries with backoff, so the events come back. If we shed too much, we lose clicks entirely; the signed token includes client_ts, so late retries still attribute to the correct window.
Kafka partition unavailable. With acks=all and min.insync.replicas=2, a single broker loss is invisible. Loss of two brokers in the same partition stalls writes for that ad_id range. Gateway queues briefly in memory (bounded) and rejects beyond that; SDK retries.
Flink job crash. Restart from the last checkpoint (every 30s). Replays ~30s of Kafka. Exactly-once semantics mean no double count — the write to Druid is transactional and the checkpoint includes the Kafka offset. This is the property we architected for.
Druid unavailable. Streaming layer buffers aggregates downstream of Flink (a secondary Kafka topic works well here) and retries. Advertiser queries return a staleness banner instead of wrong data. Crucially, the batch reconciliation still runs against S3 and catches up when Druid is back.
Entire streaming path is down for hours. Advertisers see stale dashboards. Billing is unaffected because it reads reconciled hourly data from the batch path, which only needs S3 and Spark — a fully independent failure domain. This is the deepest reason the Lambda architecture is worth its weight.
Clock skew on ad servers. client_ts is used for windowing. A client with a 2-hour skew writes events that fall outside the allowed lateness window and get dropped by the streaming layer, then get picked up by batch reconciliation (which uses client_ts as well but has no lateness bound). The dashboard is briefly wrong, the billing is eventually right.

The pattern to notice: name what fails, name what degrades gracefully, name what doesn’t. The billing SLA — always eventually correct — is what the whole design is organized around.

10. Cross-region (as a likely follow-up)

If the interviewer asks about multi-region — and at this scale they usually do — the shape is:

Regional ingestion. Each region has its own gateway + Kafka cluster. Clicks are ingested in the region closest to the ad server that served the impression, keeping ingest latency low and surviving cross-region network partitions.
Regional streaming aggregation. Each region runs its own Flink job over its local Kafka, writing per-region aggregate rows to a global Druid cluster. Aggregate rows are small (~100 bytes × 100M/day globally), so cross-region writes on the aggregate side are cheap.
Global batch reconciliation. Raw events from all regions are replicated to a single S3 bucket (or to regional buckets with a union view). The hourly Spark job reads globally, so reconciled aggregates are globally correct even if one region’s streaming path was broken.
Tie-breakers for replays. event_id is globally unique (UUID), so cross-region deduplication in the batch job is a straightforward group-by. No distributed consensus needed — the UUID space is the tie-breaker.

The key insight I’d articulate: “Because the batch path is the source of truth, cross-region is mostly a problem about replicating raw events, not about keeping streaming state consistent across regions. That’s a much easier problem — it’s just S3 replication.”

11. What I’d skip, and say I’m skipping

Time check: five minutes left. Things I’d explicitly defer:

Click fraud and bot detection. This is its own system — an ML pipeline that flags invalid clicks, often with a 24-hour lag. It plugs in as a filter between raw events and the reconciliation job’s input, not into the streaming hot path.
Attribution windows (last-click, multi-touch). Attribution is a separate problem that reads this aggregate layer, not something we solve here.
Real-time bidding integration. RTB is the impression pipeline, not the click pipeline; the ingestion API is the only seam.
GDPR / user deletion. Important in production. Handled by maintaining a deletion log and applying it during the batch reconciliation pass. Not a whiteboard topic.
Schema evolution. Confluent Schema Registry with Avro. Mention and move on.

Saying “I’d skip this, and here’s why” is a strong SDE III signal. It shows you know the full surface and are making deliberate scoping choices, not just running out of things to talk about.

12. Wrap-up

One crisp sentence before the interviewer’s next question:

This design optimizes for fast approximate dashboards and correct billing, by accepting that the streaming layer is eventually reconciled by a batch layer that reads the audit trail. Everything else falls out of that split.

That’s the kind of articulation that lands.

What separates SDE II from SDE III on this question

SDE II usually lands the pipeline shape (Kafka → stream processor → OLAP), names Flink or Spark Streaming, mentions windows, and describes a reasonable query API.
SDE III drives scoping in the first five minutes, explains exactly-once as a property of checkpoint + transactional sink + dedup by event_id (not just naming it), calls out hot-partition salting as a real production concern, and — most distinctively — proposes the batch reconciliation layer as a peer system that owns correctness while the streaming layer owns freshness. They’ll also name at least one failure mode where the streaming layer is wrong and the batch layer saves the bill.

The difference isn’t knowledge of tools. It’s which correctness story you’re telling and whether you can name the moment the story breaks.