The problem
You’re asked to design a chat system — WhatsApp, Slack, or Messenger. The interviewer says something like: “Design a service that lets users send messages to each other in one-on-one or group conversations, in real time, with messages persisted and delivered even if the recipient is offline.”
Sounds like send-and-receive. It isn’t. A chat system is really three coupled systems — a long-lived connection fabric, a durable delivery pipeline, and a push-notification fabric for offline users — and the hidden trap is treating any of them as an afterthought. Skip the durable pipeline and messages get lost. Skip the push fabric and offline users see stale chats. Skip the connection fabric and you’re back to polling a database every second.
Below is how I’d walk through this, start to finish.
1. Clarify before you design
First 3–5 minutes. Questions I’d ask:
- Scale. DAU? How many concurrent connections — not just registered users, but actively connected right now? How many messages per user per day? This number drives almost every subsequent choice.
- Conversation shape. 1:1 only, or groups? Max group size? Group fan-out at 1,000 members is manageable; at 100,000 it’s a different problem.
- Message types. Text only, or media (images, video, voice)? Media goes to object storage with a CDN in front — usually defer the details.
- History retention. Do messages live forever, or is there a TTL? This changes the storage cost story dramatically.
- Delivery semantics. At-most-once, at-least-once, exactly-once? Read receipts? Typing indicators? These are user-visible and worth pinning down.
- Offline behavior. If a recipient is offline for a week, do they see all missed messages when they reconnect? Two weeks? Six months? This determines the offline buffer design.
- End-to-end encryption. In scope, or gesture at it and move on?
Say the interviewer confirms: 1B DAU, 100M concurrent connections, 50B messages/day, groups up to 1,000 members, text + media, messages kept for 1 year, at-least-once delivery with client-side dedup, read receipts yes, offline buffer for up to 30 days, E2E noted but not deep-dived.
2. Capacity estimate
- Connections: 100M concurrent. Each open WebSocket consumes a few KB of kernel memory + TCP state. Even a generous 10KB per connection means ~1TB of RAM across the connection fleet — spread across thousands of connection servers.
- Messages: 50B/day ≈ 580K messages/sec average. Peak (timezone overlap, news events) is easily 2–3× that, so design for ~2M writes/sec at the message pipeline.
- Storage: 50B × 365 × ~200 bytes per message (text + metadata) ≈ 3.6 PB/year for text alone. Media is an order of magnitude larger but goes to a separate store.
I’d say out loud: “Two dominant constraints: the connection fleet has to hold 100M open sockets, and the delivery pipeline has to absorb millions of messages per second while guaranteeing per-conversation ordering. These are different problems, and I’ll design them as distinct services.”
3. API design
Three surfaces do the work:
POST /api/messages
body: { conversation_id, text, media_ids?, client_msg_id }
returns: { message_id, server_ts }
GET /api/conversations/:conversation_id/messages?before=<cursor>&limit=50
returns: { messages: [...], next_cursor }
WebSocket /ws
(after auth handshake)
client → { type: "send", ... }
server → { type: "message", conversation_id, message_id, from, text, server_ts }
server → { type: "ack", client_msg_id, message_id }
Non-obvious decision: client_msg_id on every send. This is the client-generated
idempotency key — a UUID the client picks before sending. If the network drops and the
client retries, the server deduplicates by client_msg_id and returns the existing
message_id. Without this, retries produce duplicate messages. With it, the client
can retry freely and the server does exactly-once-from-the-user’s-perspective. Worth
saying out loud — it’s the single API decision that makes the delivery semantics work.
4. Core design choice: delivery and ordering guarantees
This is the question. Two coupled decisions: what delivery semantics, and how do we order messages?
Delivery semantics. Three options:
- Fire-and-forget. Server accepts, tries to deliver once, moves on. Cheapest; also unacceptable — messages get lost on any transient failure.
- Exactly-once. A promise distributed systems can rarely keep. What people usually mean is at-least-once delivery + idempotent handling on the receiver. Fine — but the implementation is at-least-once underneath.
- At-least-once with client-side dedup. Server guarantees the message is delivered
at least once to each recipient; client deduplicates on
(conversation_id, message_id). This is what WhatsApp, Slack, iMessage all do.
I’d commit to at-least-once with client dedup. The justification is topic-specific:
chat messages are small, immutable, and addressed by an opaque message_id. Dedup on
the receiver is a hash-set lookup — nearly free. The cost of at-least-once (occasional
duplicate delivery) is shifted to a place where it’s cheap to handle.
Ordering. Users expect per-conversation monotonic ordering — never “reply before question” within a single chat. They do not expect global ordering across conversations (that’s impossible at scale and nobody would notice anyway).
The mechanism: Snowflake-style IDs
generated at the message service, with the conversation routed to a single partition
at write time. A single partition owner per conversation means IDs within a conversation
are monotonically increasing; clients sort by message_id and display in order.
SDE II vs III diverges here. SDE II says “at-least-once, Snowflake IDs.” SDE III
names the client_msg_id idempotency key, explains per-conversation (not global)
ordering as the only achievable and useful guarantee, and names the partition owner as
the source of monotonicity.
5. Data model and storage
Four stores:
messages (authoritative message history):
conversation_id (partition key)
message_id (clustering key, descending)
sender_id
text
media_ids
server_ts
client_msg_id
conversations (membership and metadata):
conversation_id (PK)
type (1:1 or group)
member_ids
created_at
user_devices (for offline push):
user_id
device_id
push_token (APNs or FCM token)
last_seen
presence (ephemeral):
user_id → { status, last_heartbeat }
Storage choices:
messages— the single hottest store. Append-heavy, queried by(conversation_id, recent range). This access pattern is the textbook case for Cassandra: partition byconversation_id, cluster bymessage_iddescending, and a “last 50 messages in this conversation” query is a single partition scan. No joins, no cross-conversation transactions.conversationsanduser_devices— relatively small, moderate write rate. Sharded MySQL or DynamoDB works fine; the access pattern is point lookups.presence— ephemeral, rewritten constantly (heartbeats every 30 seconds), read opportunistically. Redis with short TTL. Don’t persist it — if Redis goes down, presence briefly shows everyone offline; the system recovers in seconds as clients heartbeat again.
Topic-specific justification: “The messages table is basically a per-conversation append-only log. That’s exactly Cassandra’s sweet spot — partitioning aligns with the query, writes are monotonic within a partition, and we never need to query across conversations. Using a relational DB here would work, but we’d pay for features we don’t use.”
6. Connection management
The connection fleet is half the system. Two decisions: the protocol, and how to route a client to a server that holds their socket.
Protocol: WebSocket. Bidirectional, persistent, low per-message overhead. Long-poll works but wastes server cycles reopening TCP connections. SSE is server-push only — fine for notifications, not for a chat system where clients also send. WebSocket is the obvious choice, but worth naming the alternatives and dismissing them for specific reasons.
Connection servers are stateful — they hold open sockets. A client connects once and stays connected for minutes to hours. Sizing: a well-tuned connection server handles ~100K–500K open WebSocket connections. At 100M concurrent, that’s a fleet of ~300–1,000 connection servers.
Routing. A client needs to reach the server holding its socket (for outbound messages), and the system needs to know which server holds a given recipient’s socket (for inbound delivery). This is the user-to-server registry problem.
Mermaid diagram of the connection path:
flowchart LR
Client[Client] --> LB[Load balancer]
LB --> CS[Connection server]
CS --> Registry[(User-to-server registry · Redis)]
CS --> Auth[Auth service]
Step by step:
- Client connects to the load balancer via WebSocket upgrade.
- Load balancer forwards to a connection server. Sticky routing is not required at
this layer — the connection server, once chosen, publishes
(user_id, device_id) → this_serverinto the user-to-server registry. - Connection server completes the auth handshake.
- On every heartbeat, the connection server refreshes the registry entry’s TTL.
- On disconnect (clean or via timeout), the registry entry expires and is evicted.
Topic-specific justification: “The registry is the one piece of shared state in an otherwise stateless-looking system. It’s eventually consistent — if a client reconnects to a new server, there’s a brief window where two entries exist. That’s fine: the old one expires, and during the overlap a duplicate delivery is absorbed by the client-side dedup from Section 4.”
7. Message delivery path
Now the full send flow. Extending the architecture:
flowchart LR
Client[Client] --> LB[Load balancer]
LB --> CS[Connection server]
CS --> Registry[(User-to-server registry · Redis)]
CS --> Auth[Auth service]
CS --> MsgSvc[Message service]
MsgSvc --> DedupCache[(Redis idempotency cache)]
MsgSvc --> MsgDB[(Cassandra messages)]
MsgSvc -->|emit delivery event| Kafka[(Kafka · messages topic)]
Kafka --> DeliveryWorker[Delivery workers]
DeliveryWorker --> Registry
DeliveryWorker --> CS2[Recipient connection server]
CS2 --> RecipientClient[Recipient client]
classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
class MsgSvc,DedupCache,MsgDB,Kafka,DeliveryWorker,CS2,RecipientClient new
Send, step by step:
- Sender emits a WebSocket frame
{ type: "send", conversation_id, text, client_msg_id }. - Connection server forwards to the message service.
- Message service checks the Redis idempotency cache for
(sender_id, client_msg_id). Hit → return the existingmessage_id. Miss → proceed. - Message service generates a Snowflake
message_idfor this conversation, writes the message to Cassandra (partitionconversation_id, sortmessage_id). - Message service emits a delivery event to Kafka with the recipient list (for 1:1, one recipient; for group, all members).
- Message service returns
ack { client_msg_id, message_id }to the sender’s connection server, which forwards it to the sender’s client. - Delivery workers consume the Kafka event. For each recipient:
- Query the registry: which connection server holds this recipient’s socket?
- Hit: forward the message via an internal RPC to that connection server, which pushes the WebSocket frame to the client.
- Miss (offline): enqueue for the offline path (next section).
- Each recipient client receives, deduplicates on
message_id, displays.
Why async via Kafka:
- Decouples send latency from fan-out. The sender’s
ackreturns as soon as the authoritative write lands. A group message to 1,000 members isn’t 1,000 synchronous RPCs on the send path. - Absorbs bursts. Big group announcements create spikes; Kafka smooths them.
- Replay for recovery. If a delivery worker drops events due to a bug, we replay from Kafka and re-deliver. Client-side dedup makes this safe.
Topic-specific justification: “Per-conversation ordering is preserved because a single
Kafka partition (keyed by conversation_id) is consumed by a single worker. Within a
partition, Kafka guarantees order, and the worker dispatches in that order. Across
conversations, ordering is deliberately unconstrained.”
8. Sharding and offline notifications
Sharding falls out of the data model:
messagespartitioned byconversation_id. Hot-partition risk exists (a viral group chat), but it’s bounded by group size. Mitigation: per-group rate limits if needed.- Connection server fleet — no sharding per se; clients land on whichever server the LB picks. The registry is the source of truth for “who is where.”
- Kafka
messagestopic — partitioned byconversation_idfor ordering. This is the same key as the messages DB, which means a given conversation’s events always flow through the same Kafka partition and land in the same DB partition. Clean.
Offline notifications are the third coupled system. Extending the architecture one last time:
flowchart LR
Client[Client] --> LB[Load balancer]
LB --> CS[Connection server]
CS --> Registry[(User-to-server registry · Redis)]
CS --> Auth[Auth service]
CS --> MsgSvc[Message service]
MsgSvc --> DedupCache[(Redis idempotency cache)]
MsgSvc --> MsgDB[(Cassandra messages)]
MsgSvc -->|emit delivery event| Kafka[(Kafka · messages topic)]
Kafka --> DeliveryWorker[Delivery workers]
DeliveryWorker --> Registry
DeliveryWorker --> CS2[Recipient connection server]
CS2 --> RecipientClient[Recipient client]
DeliveryWorker --> Devices[(user_devices DB)]
DeliveryWorker --> PushGateway[Push gateway]
PushGateway --> APNS[APNs]
PushGateway --> FCM[FCM]
classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
class Devices,PushGateway,APNS,FCM new
When a delivery worker finds no registry entry for a recipient (offline), it:
- Looks up the recipient’s devices in the
user_devicesstore. - For each device, sends a push notification payload via the push gateway to APNs (iOS) or FCM (Android). Payload is typically a preview; the full message is fetched when the user opens the app.
- Messages remain in Cassandra. When the recipient eventually reconnects, their
client issues
GET /conversations/:id/messages?before=<last_seen>and hydrates missed content.
Topic-specific justification: “We don’t need a separate ‘offline message queue’ — the
message store is the queue. The client fetches missed messages on reconnect by querying
the conversations it cares about, bounded by last_seen. Push notifications are just a
wake-up signal; they don’t carry the delivery guarantee.”
9. Failure modes
Four to walk through — name what fails, name what degrades gracefully, name what doesn’t.
- A connection server crashes. All its open sockets drop. Clients reconnect (WebSocket clients have exponential-backoff reconnect built in), land on a new server, register, and the registry converges within seconds. Missed messages during the outage window are fetched via history query on reconnect. Graceful.
- Registry (Redis) is unavailable. Delivery workers can’t look up recipients. Two paths: queue the events (Kafka is already doing this), or fall back to pushing offline notifications for everyone until the registry recovers. The second is slightly worse UX but keeps users informed. This is a severity-2, not a severity-1 — messages still land in the durable store.
- Cassandra partition unavailable. A whole conversation becomes unreadable and un-writable. This is severe — there’s no graceful degradation for “you can’t see your chat.” Mitigation: quorum writes across 3 replicas, so one node down is tolerable; multi-region replication so a regional outage doesn’t take a conversation fully offline.
- APNs/FCM is down. Offline users don’t get notifications. When they manually open the app, they still see all messages — the durable store doesn’t depend on the push provider. The system degrades to “you find out when you check.” Mitigation: retry with backoff; push gateways implement circuit breakers — see Martin Fowler on the pattern — so a bad upstream doesn’t cascade.
10. What I’d skip, and say I’m skipping
Time check. Things I’d explicitly defer:
- End-to-end encryption. Signal Protocol is the industry standard; at a chat-system design level, the key observation is that E2E pushes all message-content concerns to the client and changes the server to a ciphertext relay. Worth one sentence, not a section.
- Media pipeline. Object storage (S3) + CDN + thumbnail generation is a peer system — it doesn’t belong in the core design.
- Full-text search across history. A separate indexing pipeline feeding Elasticsearch. Worth mentioning as a second system; not worth designing.
- Compliance and retention. Message retention policies, GDPR erasure, legal hold — real products need these; interviews don’t.
- Federation. Matrix-style cross-server chat is a different problem shape entirely. Out of scope.
Saying “I’d skip this, and here’s why” is a strong SDE III signal. It shows you know the full surface and are making deliberate scoping choices.
11. Wrap-up
One crisp sentence:
This design treats chat as three coupled systems — a stateful connection fleet, a durable delivery pipeline, and an offline-notification fabric — joined by a shared idempotency-and-ordering contract on message IDs.
What separates SDE II from SDE III on this question
- SDE II usually picks WebSocket, names at-least-once delivery, describes message persistence, and sketches a basic send flow.
- SDE III names the client-side idempotency key (
client_msg_id) as the mechanism that makes at-least-once work, explains per-conversation (not global) ordering as the only achievable and useful guarantee, treats the offline push fabric as a peer system with its own failure story, walks through connection-server registry eventual consistency, and explicitly defers E2E/media/search as “second systems” that don’t belong in this answer.
The difference isn’t knowledge. It’s naming the contracts at the boundaries — between client and server (idempotency key), between send and deliver (at-least-once with per-partition ordering), between online and offline (durable store as queue, push as wake-up).
Further reading
- Twitter’s Timeline at Scale (InfoQ) — not a chat system, but the operational patterns for long-lived fan-out at scale transfer directly.
- WhatsApp Architecture (High Scalability) — the seminal public-facing writeup on how a small team ran billions of messages through a tiny Erlang cluster. Old but still a foundational read on connection-fleet sizing.
- Discord Engineering blog — modern operational posts on running trillions of messages, Cassandra-to-ScyllaDB migration, and the realities of at-scale chat. The “How Discord Stores Trillions of Messages” post is the direct parallel to Section 5.
- Designing Data-Intensive Applications — Kleppmann. Chapters 9 (consistency and consensus) and 11 (stream processing) are the backbone of the delivery-semantics and Kafka sections above.
- System Design Interview Vol. 1 — Alex Xu. Chapter 12 is the textbook walkthrough.
- Twitter Snowflake — the ID-generation scheme underlying per-conversation ordering.