Walkthrough: Designing a Chat System

The problem

You’re asked to design a chat system — WhatsApp, Slack, or Messenger. The interviewer says something like: “Design a service that lets users send messages to each other in one-on-one or group conversations, in real time, with messages persisted and delivered even if the recipient is offline.”

Sounds like send-and-receive. It isn’t. A chat system is really three coupled systems — a long-lived connection fabric, a durable delivery pipeline, and a push-notification fabric for offline users — and the hidden trap is treating any of them as an afterthought. Skip the durable pipeline and messages get lost. Skip the push fabric and offline users see stale chats. Skip the connection fabric and you’re back to polling a database every second.

Below is how I’d walk through this, start to finish.

1. Clarify before you design

First 3–5 minutes. Questions I’d ask:

Scale. DAU? How many concurrent connections — not just registered users, but actively connected right now? How many messages per user per day? This number drives almost every subsequent choice.
Conversation shape. 1:1 only, or groups? Max group size? Group fan-out at 1,000 members is manageable; at 100,000 it’s a different problem.
Message types. Text only, or media (images, video, voice)? Media goes to object storage with a CDN in front — usually defer the details.
History retention. Do messages live forever, or is there a TTL? This changes the storage cost story dramatically.
Delivery semantics. At-most-once, at-least-once, exactly-once? Read receipts? Typing indicators? These are user-visible and worth pinning down.
Offline behavior. If a recipient is offline for a week, do they see all missed messages when they reconnect? Two weeks? Six months? This determines the offline buffer design.
End-to-end encryption. In scope, or gesture at it and move on?

Say the interviewer confirms: 1B DAU, 100M concurrent connections, 50B messages/day, groups up to 1,000 members, text + media, messages kept for 1 year, at-least-once delivery with client-side dedup, read receipts yes, offline buffer for up to 30 days, E2E noted but not deep-dived.

2. Capacity estimate

Connections: 100M concurrent. Each open WebSocket consumes a few KB of kernel memory + TCP state. Even a generous 10KB per connection means ~1TB of RAM across the connection fleet — spread across thousands of connection servers.
Messages: 50B/day ≈ 580K messages/sec average. Peak (timezone overlap, news events) is easily 2–3× that, so design for ~2M writes/sec at the message pipeline.
Storage: 50B × 365 × ~200 bytes per message (text + metadata) ≈ 3.6 PB/year for text alone. Media is an order of magnitude larger but goes to a separate store.

I’d say out loud: “Two dominant constraints: the connection fleet has to hold 100M open sockets, and the delivery pipeline has to absorb millions of messages per second while guaranteeing per-conversation ordering. These are different problems, and I’ll design them as distinct services.”

3. API design

Three surfaces do the work:

POST /api/messages
  body:    { conversation_id, text, media_ids?, client_msg_id }
  returns: { message_id, server_ts }

GET /api/conversations/:conversation_id/messages?before=<cursor>&limit=50
  returns: { messages: [...], next_cursor }

WebSocket /ws
  (after auth handshake)
  client → { type: "send", ... }
  server → { type: "message", conversation_id, message_id, from, text, server_ts }
  server → { type: "ack", client_msg_id, message_id }

Non-obvious decision: client_msg_id on every send. This is the client-generated idempotency key — a UUID the client picks before sending. If the network drops and the client retries, the server deduplicates by client_msg_id and returns the existing message_id. Without this, retries produce duplicate messages. With it, the client can retry freely and the server does exactly-once-from-the-user’s-perspective. Worth saying out loud — it’s the single API decision that makes the delivery semantics work.

4. Core design choice: delivery and ordering guarantees

This is the question. Two coupled decisions: what delivery semantics, and how do we order messages?

Delivery semantics. Three options:

Fire-and-forget. Server accepts, tries to deliver once, moves on. Cheapest; also unacceptable — messages get lost on any transient failure.
Exactly-once. A promise distributed systems can rarely keep. What people usually mean is at-least-once delivery + idempotent handling on the receiver. Fine — but the implementation is at-least-once underneath.
At-least-once with client-side dedup. Server guarantees the message is delivered at least once to each recipient; client deduplicates on (conversation_id, message_id). This is what WhatsApp, Slack, iMessage all do.

I’d commit to at-least-once with client dedup. The justification is topic-specific: chat messages are small, immutable, and addressed by an opaque message_id. Dedup on the receiver is a hash-set lookup — nearly free. The cost of at-least-once (occasional duplicate delivery) is shifted to a place where it’s cheap to handle.

Ordering. Users expect per-conversation monotonic ordering — never “reply before question” within a single chat. They do not expect global ordering across conversations (that’s impossible at scale and nobody would notice anyway).

The mechanism: Snowflake-style IDs generated at the message service, with the conversation routed to a single partition at write time. A single partition owner per conversation means IDs within a conversation are monotonically increasing; clients sort by message_id and display in order.

SDE II vs III diverges here. SDE II says “at-least-once, Snowflake IDs.” SDE III names the client_msg_id idempotency key, explains per-conversation (not global) ordering as the only achievable and useful guarantee, and names the partition owner as the source of monotonicity.

5. Data model and storage

Four stores:

messages (authoritative message history):
  conversation_id (partition key)
  message_id      (clustering key, descending)
  sender_id
  text
  media_ids
  server_ts
  client_msg_id

conversations (membership and metadata):
  conversation_id (PK)
  type            (1:1 or group)
  member_ids
  created_at

user_devices (for offline push):
  user_id
  device_id
  push_token      (APNs or FCM token)
  last_seen

presence (ephemeral):
  user_id → { status, last_heartbeat }

Storage choices:

messages — the single hottest store. Append-heavy, queried by (conversation_id, recent range). This access pattern is the textbook case for Cassandra: partition by conversation_id, cluster by message_id descending, and a “last 50 messages in this conversation” query is a single partition scan. No joins, no cross-conversation transactions.
conversations and user_devices — relatively small, moderate write rate. Sharded MySQL or DynamoDB works fine; the access pattern is point lookups.
presence — ephemeral, rewritten constantly (heartbeats every 30 seconds), read opportunistically. Redis with short TTL. Don’t persist it — if Redis goes down, presence briefly shows everyone offline; the system recovers in seconds as clients heartbeat again.

Topic-specific justification: “The messages table is basically a per-conversation append-only log. That’s exactly Cassandra’s sweet spot — partitioning aligns with the query, writes are monotonic within a partition, and we never need to query across conversations. Using a relational DB here would work, but we’d pay for features we don’t use.”

6. Connection management

The connection fleet is half the system. Two decisions: the protocol, and how to route a client to a server that holds their socket.

Protocol: WebSocket. Bidirectional, persistent, low per-message overhead. Long-poll works but wastes server cycles reopening TCP connections. SSE is server-push only — fine for notifications, not for a chat system where clients also send. WebSocket is the obvious choice, but worth naming the alternatives and dismissing them for specific reasons.

Connection servers are stateful — they hold open sockets. A client connects once and stays connected for minutes to hours. Sizing: a well-tuned connection server handles ~100K–500K open WebSocket connections. At 100M concurrent, that’s a fleet of ~300–1,000 connection servers.

Routing. A client needs to reach the server holding its socket (for outbound messages), and the system needs to know which server holds a given recipient’s socket (for inbound delivery). This is the user-to-server registry problem.

Mermaid diagram of the connection path:

flowchart LR
  Client[Client] --> LB[Load balancer]
  LB --> CS[Connection server]
  CS --> Registry[(User-to-server registry · Redis)]
  CS --> Auth[Auth service]

Step by step:

Client connects to the load balancer via WebSocket upgrade.
Load balancer forwards to a connection server. Sticky routing is not required at this layer — the connection server, once chosen, publishes (user_id, device_id) → this_server into the user-to-server registry.
Connection server completes the auth handshake.
On every heartbeat, the connection server refreshes the registry entry’s TTL.
On disconnect (clean or via timeout), the registry entry expires and is evicted.

Topic-specific justification: “The registry is the one piece of shared state in an otherwise stateless-looking system. It’s eventually consistent — if a client reconnects to a new server, there’s a brief window where two entries exist. That’s fine: the old one expires, and during the overlap a duplicate delivery is absorbed by the client-side dedup from Section 4.”

7. Message delivery path

Now the full send flow. Extending the architecture:

flowchart LR
  Client[Client] --> LB[Load balancer]
  LB --> CS[Connection server]
  CS --> Registry[(User-to-server registry · Redis)]
  CS --> Auth[Auth service]

  CS --> MsgSvc[Message service]
  MsgSvc --> DedupCache[(Redis idempotency cache)]
  MsgSvc --> MsgDB[(Cassandra messages)]
  MsgSvc -->|emit delivery event| Kafka[(Kafka · messages topic)]
  Kafka --> DeliveryWorker[Delivery workers]
  DeliveryWorker --> Registry
  DeliveryWorker --> CS2[Recipient connection server]
  CS2 --> RecipientClient[Recipient client]

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class MsgSvc,DedupCache,MsgDB,Kafka,DeliveryWorker,CS2,RecipientClient new

Send, step by step:

Sender emits a WebSocket frame { type: "send", conversation_id, text, client_msg_id }.
Connection server forwards to the message service.
Message service checks the Redis idempotency cache for (sender_id, client_msg_id). Hit → return the existing message_id. Miss → proceed.
Message service generates a Snowflake message_id for this conversation, writes the message to Cassandra (partition conversation_id, sort message_id).
Message service emits a delivery event to Kafka with the recipient list (for 1:1, one recipient; for group, all members).
Message service returns ack { client_msg_id, message_id } to the sender’s connection server, which forwards it to the sender’s client.
Delivery workers consume the Kafka event. For each recipient:
- Query the registry: which connection server holds this recipient’s socket?
- Hit: forward the message via an internal RPC to that connection server, which pushes the WebSocket frame to the client.
- Miss (offline): enqueue for the offline path (next section).
Each recipient client receives, deduplicates on message_id, displays.

Why async via Kafka:

Decouples send latency from fan-out. The sender’s ack returns as soon as the authoritative write lands. A group message to 1,000 members isn’t 1,000 synchronous RPCs on the send path.
Absorbs bursts. Big group announcements create spikes; Kafka smooths them.
Replay for recovery. If a delivery worker drops events due to a bug, we replay from Kafka and re-deliver. Client-side dedup makes this safe.

Topic-specific justification: “Per-conversation ordering is preserved because a single Kafka partition (keyed by conversation_id) is consumed by a single worker. Within a partition, Kafka guarantees order, and the worker dispatches in that order. Across conversations, ordering is deliberately unconstrained.”

8. Sharding and offline notifications

Sharding falls out of the data model:

messages partitioned by conversation_id. Hot-partition risk exists (a viral group chat), but it’s bounded by group size. Mitigation: per-group rate limits if needed.
Connection server fleet — no sharding per se; clients land on whichever server the LB picks. The registry is the source of truth for “who is where.”
Kafka messages topic — partitioned by conversation_id for ordering. This is the same key as the messages DB, which means a given conversation’s events always flow through the same Kafka partition and land in the same DB partition. Clean.

Offline notifications are the third coupled system. Extending the architecture one last time:

flowchart LR
  Client[Client] --> LB[Load balancer]
  LB --> CS[Connection server]
  CS --> Registry[(User-to-server registry · Redis)]
  CS --> Auth[Auth service]

  CS --> MsgSvc[Message service]
  MsgSvc --> DedupCache[(Redis idempotency cache)]
  MsgSvc --> MsgDB[(Cassandra messages)]
  MsgSvc -->|emit delivery event| Kafka[(Kafka · messages topic)]
  Kafka --> DeliveryWorker[Delivery workers]
  DeliveryWorker --> Registry
  DeliveryWorker --> CS2[Recipient connection server]
  CS2 --> RecipientClient[Recipient client]

  DeliveryWorker --> Devices[(user_devices DB)]
  DeliveryWorker --> PushGateway[Push gateway]
  PushGateway --> APNS[APNs]
  PushGateway --> FCM[FCM]

  classDef new fill:#eef2f1,stroke:#2c5f5d,stroke-width:2px,color:#1f2937;
  class Devices,PushGateway,APNS,FCM new

When a delivery worker finds no registry entry for a recipient (offline), it:

Looks up the recipient’s devices in the user_devices store.
For each device, sends a push notification payload via the push gateway to APNs (iOS) or FCM (Android). Payload is typically a preview; the full message is fetched when the user opens the app.
Messages remain in Cassandra. When the recipient eventually reconnects, their client issues GET /conversations/:id/messages?before=<last_seen> and hydrates missed content.

Topic-specific justification: “We don’t need a separate ‘offline message queue’ — the message store is the queue. The client fetches missed messages on reconnect by querying the conversations it cares about, bounded by last_seen. Push notifications are just a wake-up signal; they don’t carry the delivery guarantee.”

9. Failure modes

Four to walk through — name what fails, name what degrades gracefully, name what doesn’t.

A connection server crashes. All its open sockets drop. Clients reconnect (WebSocket clients have exponential-backoff reconnect built in), land on a new server, register, and the registry converges within seconds. Missed messages during the outage window are fetched via history query on reconnect. Graceful.
Registry (Redis) is unavailable. Delivery workers can’t look up recipients. Two paths: queue the events (Kafka is already doing this), or fall back to pushing offline notifications for everyone until the registry recovers. The second is slightly worse UX but keeps users informed. This is a severity-2, not a severity-1 — messages still land in the durable store.
Cassandra partition unavailable. A whole conversation becomes unreadable and un-writable. This is severe — there’s no graceful degradation for “you can’t see your chat.” Mitigation: quorum writes across 3 replicas, so one node down is tolerable; multi-region replication so a regional outage doesn’t take a conversation fully offline.
APNs/FCM is down. Offline users don’t get notifications. When they manually open the app, they still see all messages — the durable store doesn’t depend on the push provider. The system degrades to “you find out when you check.” Mitigation: retry with backoff; push gateways implement circuit breakers — see Martin Fowler on the pattern — so a bad upstream doesn’t cascade.

10. What I’d skip, and say I’m skipping

Time check. Things I’d explicitly defer:

End-to-end encryption. Signal Protocol is the industry standard; at a chat-system design level, the key observation is that E2E pushes all message-content concerns to the client and changes the server to a ciphertext relay. Worth one sentence, not a section.
Media pipeline. Object storage (S3) + CDN + thumbnail generation is a peer system — it doesn’t belong in the core design.
Full-text search across history. A separate indexing pipeline feeding Elasticsearch. Worth mentioning as a second system; not worth designing.
Compliance and retention. Message retention policies, GDPR erasure, legal hold — real products need these; interviews don’t.
Federation. Matrix-style cross-server chat is a different problem shape entirely. Out of scope.

Saying “I’d skip this, and here’s why” is a strong SDE III signal. It shows you know the full surface and are making deliberate scoping choices.

11. Wrap-up

One crisp sentence:

This design treats chat as three coupled systems — a stateful connection fleet, a durable delivery pipeline, and an offline-notification fabric — joined by a shared idempotency-and-ordering contract on message IDs.

What separates SDE II from SDE III on this question

SDE II usually picks WebSocket, names at-least-once delivery, describes message persistence, and sketches a basic send flow.
SDE III names the client-side idempotency key (client_msg_id) as the mechanism that makes at-least-once work, explains per-conversation (not global) ordering as the only achievable and useful guarantee, treats the offline push fabric as a peer system with its own failure story, walks through connection-server registry eventual consistency, and explicitly defers E2E/media/search as “second systems” that don’t belong in this answer.

The difference isn’t knowledge. It’s naming the contracts at the boundaries — between client and server (idempotency key), between send and deliver (at-least-once with per-partition ordering), between online and offline (durable store as queue, push as wake-up).