R — Requirements

Functional Requirements

1. Core Meeting Flow

The application should support:

Create a meeting room
Join a meeting using room ID or link
Leave meeting
End meeting (host only)
Rejoin automatically after temporary disconnect

2. Audio & Video Controls

Participants should be able to:

Turn camera on/off
Mute/unmute microphone
Switch camera or microphone device
See video streams of other participants
View mute/video status of others

3. Participant Management

The UI should support:

Dynamic grid layout based on participant count
Active speaker detection
Display participant names
Host controls (mute participant, remove participant)
Raise hand indicator

Participants should be able to:

Share entire screen or specific window
Stop screen sharing
View screen share in primary layout
Switch between grid and presentation mode

5. In-Meeting Chat

Participants should be able to:

Send text messages
Receive messages in real time
Send reactions (emoji)
View message history during session

Non-Functional Requirements

1. Internationalization (i18n / L10n)

Multi-language UI
Localized time formats
Support for captions in multiple languages

2. Offline & Network Handling

Automatic reconnection on network drop
Graceful degradation on low bandwidth
Maintain call state during short disconnects

3. Security Considerations

Secure meeting access (room tokens / passwords)
Encrypted media streams
Permission-based device access
Prevent unauthorized participants

Keyboard controls for meeting actions
Screen reader support for controls
Clear focus management
High contrast mode support

5. Performance Expectations

Low join latency
Minimal audio/video lag
Stable frame rate
Smooth layout transitions during participant changes

6. Recording & Media Controls (Optional Advanced Feature)

Start/stop recording (host)
Show recording indicator
Display network quality indicator

A — Architecture

This section defines the major architectural decisions for a browser-based video calling system and why each choice fits real-time communication.

1. SSR vs CSR vs SSG

Chosen: CSR

Area	Rendering Strategy	Why
Video call experience	CSR	Depends on browser media APIs
Meeting landing page	SSR (optional)	Faster initial load

Video calling relies heavily on browser APIs like camera, microphone, and screen capture. Rendering the call UI on the server provides no real value.

2. REST APIs (App Backend)

REST APIs are used for:

Creating and joining meeting rooms
Fetching participant metadata
Authentication and meeting permissions
Sending chat messages and reactions

These APIs handle room lifecycle and meeting metadata, not media streaming.

3. Communication Protocol — WebRTC

Chosen: WebRTC (Web Real-Time Communication) for audio and video streaming

Real-time video calling has extremely strict latency and reliability requirements. The communication protocol must support continuous, bi-directional media transfer with minimal delay, automatic network adaptation, and strong security.

Available Options Considered

HTTP streaming (HLS / DASH)
WebSockets
WebRTC

Why HTTP Streaming Is Not Suitable

HTTP streaming is designed for watching video, not for talking.

It works by downloading small video chunks and buffering several seconds ahead of playback. This buffering introduces multi-second latency, which is acceptable for streaming platforms but unacceptable for conversations.

In a video call, even a 1–2 second delay breaks natural interaction.

Why WebSockets Alone Are Not Suitable

WebSockets provide a persistent connection and low-latency messaging, but they are not designed for media streaming.

Using WebSockets for video would require building:

Custom video encoding and decoding
Congestion control algorithms
Adaptive bitrate logic
Packet loss recovery
Jitter buffering

This would effectively mean rebuilding a media streaming protocol from scratch.

Why WebRTC Is the Correct Choice

WebRTC is specifically designed for real-time communication in browsers and already solves the hard problems required for video calling.

Ultra-Low Latency Media Transport

WebRTC is built for real-time interaction and typically delivers sub-second latency.
This allows natural conversation without noticeable delay.

Built-In Encryption

All WebRTC media streams are encrypted by default using:

DTLS for key exchange
SRTP for media transport

This ensures audio, video, and screen sharing are secure without additional implementation.

Adaptive Bitrate & Congestion Control

Network conditions constantly change during calls. WebRTC automatically:

Adjusts video resolution
Adjusts bitrate and frame rate
Recovers from packet loss
Handles jitter and unstable networks

This keeps calls stable even on weak connections.

NAT Traversal & Firewall Handling

Most users are behind routers and firewalls. WebRTC includes ICE, STUN, and TURN mechanisms to establish connections even in restricted network environments.

Without this, most peer-to-peer connections would fail.

Browser-Native Support

Modern browsers provide built-in APIs for:

Camera and microphone access
Screen sharing
Media encoding and decoding
Network transport

This allows real-time calls without plugins or external software.

6. Peer-to-Peer vs SFU vs MCU

This is the most critical decision in video calling architecture because it determines scalability, latency, bandwidth usage, and infrastructure cost.

There are three major ways multi-participant calls can be built.

6.1 Mesh (Pure Peer-to-Peer)

In a mesh architecture, every participant sends their audio/video stream directly to every other participant.

If there are N participants, each user sends N-1 streams and receives N-1 streams.

Example with 5 people in a meeting:

Each user uploads 4 video streams
Each user downloads 4 video streams

Total connections grow rapidly as participants increase.

Why Mesh Works for Very Small Calls

For 1-to-1 calls:

Only one connection is required
No server bandwidth needed
Lowest infrastructure cost

This is why WebRTC originally started as a peer-to-peer technology.

Why Mesh Fails for Group Calls

Bandwidth grows exponentially.

For N users:

Upload bandwidth per user ≈ (N-1) streams
CPU encoding cost multiplies per stream

With 10 participants, each user would need to:

Encode and upload 9 video streams simultaneously
Download 9 streams

Most laptops and networks cannot sustain this.

Mesh becomes impractical beyond ~4 participants.

ChatGPT Image Feb 28, 2026, 05_40_41 PM.png

6.2 MCU (Multipoint Control Unit)

In an MCU architecture, all participants send their video to a central server.
The server mixes all streams into a single combined video and sends one stream back to each user.

So each user:

Uploads 1 stream
Downloads 1 combined stream

Advantages of MCU

Very simple for clients
Low bandwidth usage on user devices
Works well on weak devices

This was common in early enterprise video conferencing systems.

Why MCU Is Not Ideal Today

The server must:

Decode every participant’s video
Mix all videos into a single composition
Re-encode the final video for every participant

This is extremely CPU and GPU intensive.

Problems:

Very high infrastructure cost
Increased latency due to encoding pipeline
Fixed layout (less flexibility for UI)
Hard to support dynamic layouts (speaker view, grid view)

MCU pushes too much work to the server.

ChatGPT Image Feb 28, 2026, 05_50_00 PM.png

6.3 SFU (Selective Forwarding Unit) — Chosen

An SFU acts as a smart router for media streams.

Participants send their video stream to the SFU.
The SFU does not decode or mix video.
It simply forwards streams to other participants.

So each user:

Uploads 1 stream
Downloads multiple streams (one per participant)

But the heavy encoding work stays on the client, not the server.

Why SFU Is the Best Balance

Scales to Large Meetings

Users upload only one stream regardless of meeting size.
The SFU distributes streams efficiently to all participants.

This removes the exponential bandwidth problem of mesh.

Lower Latency Than MCU

Since the SFU does not decode or re-encode video:

No heavy media processing on the server
Streams are forwarded quickly
Lower end-to-end latency

Dynamic Quality Selection (Simulcast)

Clients can send multiple quality layers of the same video:

Low resolution
Medium resolution
High resolution

The SFU decides which quality to forward based on:

Participant screen size
Network quality
Active speaker priority

For example:

Active speaker → high quality
Small grid tiles → low quality

This saves bandwidth and improves performance.

Flexible Layout on Client

Since streams are not mixed, each client can decide:

Grid view
Speaker view
Screen share priority

This enables modern meeting UI layouts.

ChatGPT Image Feb 28, 2026, 05_49_55 PM.png

It is called a Selective Forwarding Unit because the server does not mix or process video like an MCU; instead, it simply receives media streams from participants and forwards them to others. The “selective” part comes from the fact that it decides which streams to send, to whom, and in what quality. For example, the active speaker may be forwarded in high quality while other participants are sent in lower quality or not sent at all, depending on screen size, bandwidth, and device capability. This smart forwarding reduces bandwidth usage and allows video calls to scale efficiently.

Why Modern Platforms Use SFU

Zoom, Google Meet, Microsoft Teams, and Discord all use SFU-based architectures because it provides the best balance between:

Scalability
Cost
Latency
Flexibility

SFU has become the industry standard for large real-time meetings.

7. Signaling Server

The signaling server is the coordination layer that helps participants discover each other and establish WebRTC connections before any audio/video flows. WebRTC handles media transport, but it does not define how peers find each other or exchange connection metadata. That responsibility belongs to the signaling service.

Why a Signaling Server Is Required

Before two browsers can send audio/video, they must agree on:

Who is in the meeting
What media capabilities each device supports
How to reach each other across NAT/firewalls
Which codecs and encryption settings to use

This coordination must happen in real time, which is why signaling typically uses WebSockets.

Once the connection is established, the signaling server steps out of the media path.

Responsibilities of the Signaling Server

Room and Participant Coordination

The signaling service manages meeting rooms and participant presence.

It handles events such as:

User joins room
User leaves room
Participant list updates
Host actions (mute all, remove participant)

This allows every client to maintain a real-time view of the meeting state.

SDP Offer / Answer Exchange

To start a WebRTC connection, peers must exchange SDP messages that describe their media capabilities.

The signaling server relays these messages between participants or between client and SFU:

Offer: “Here is how I can send/receive media.”
Answer: “Here is what I support. Let’s connect.”

This negotiation step ensures both sides agree on codecs, encryption, and media directions.

ICE Candidate Exchange

Most users are behind routers and firewalls.
To find a working network path, WebRTC gathers ICE candidates (possible connection routes).

The signaling server relays these candidates so peers can discover a viable route.

Without this step, most WebRTC connections would fail.

Real-Time Meeting Events

The signaling server also distributes lightweight meeting events:

Mute/unmute notifications
Camera on/off updates
Screen share start/stop
Active speaker signals
Raise hand / reactions

These events must reach all participants instantly, making WebSockets ideal.

Typical Connection Flow

1. User Joins Meeting via HTTP API

When a user clicks a meeting link:

The browser sends an HTTP request to the backend.
The backend authenticates the user.
The backend verifies meeting permissions.
The backend returns:
- Meeting ID
- User identity
- Access token
- Signaling server URL
- ICE server configuration (STUN/TURN)

At this stage, no media connection exists yet.

2. Client Opens WebSocket to Signaling Server

The browser establishes a persistent WebSocket connection to the signaling server.

This connection is used for:

Real-time room coordination
Connection negotiation
Participant event broadcasting

Now the client is officially connected to the meeting room.

3. Server Sends Current Participant List

The signaling server responds with:

List of participants currently in the room
Host information
Active speaker (if any)
Ongoing screen share status

The client updates its UI accordingly.

4. WebRTC Offer Creation (Client → SFU)

Because we are using SFU architecture:

The client creates a WebRTC PeerConnection.
The client generates an SDP (Session Description Protocol) offer describing:
- Supported codecs
- Media tracks (audio/video)
- Encryption parameters
- Network candidates

The offer is sent via WebSocket to the signaling server, which forwards it to the SFU.

5. SFU Responds with SDP Answer

The SFU:

Reviews the client’s capabilities
Decides media routing strategy
Returns an SDP answer

The signaling server forwards this answer back to the client.

Now both sides agree on how media will be exchanged.

6. ICE Candidate Exchange

Both client and SFU gather network candidates:

Public IP candidates
Relay (TURN) candidates
Local network candidates

These candidates are exchanged via the signaling server.

This step ensures:

NAT traversal
Firewall bypass
Reliable connectivity

Once a working candidate pair is found, the WebRTC connection becomes active.

7. WebRTC Connection Established

Now:

The client starts sending audio/video tracks to the SFU.
The SFU begins forwarding other participants' streams to this client.

Media traffic now flows directly between:

Client ↔ SFU

Signaling server is no longer involved in media transport.

8. Media Streaming Begins

At this point:

Camera video flows continuously.
Microphone audio flows continuously.
Screen share (if any) flows as a separate media track.
WebRTC handles congestion control and bitrate adaptation automatically.

The call is now fully active.

9. Signaling Remains Active for Control Events

Even after media starts flowing, the WebSocket signaling connection remains open.

It handles:

New participant joins
Participant leaves
Mute/unmute events
Screen share start/stop
Host actions
Reconnection events

It acts as the control plane, while WebRTC handles the media plane.

D — Data Model

This section defines how data flows between the frontend, backend, signaling service, and SFU.
For a video calling system, the data layer focuses on:

Meeting lifecycle
Participant state
Chat and reactions
Real-time meeting events

Core Endpoints

Method	Endpoint	Purpose
POST	/api/meetings	Create meeting
GET	/api/meetings/:meetingId	Fetch meeting details
POST	/api/meetings/:meetingId/join	Join meeting and receive tokens
POST	/api/meetings/:meetingId/leave	Leave meeting
POST	/api/meetings/:meetingId/end	End meeting (host)
GET	/api/meetings/:meetingId/participants	Fetch participant list
GET	/api/meetings/:meetingId/chat	Fetch chat history
POST	/api/meetings/:meetingId/chat	Send chat message

Join Meeting Response (Important Payload)

When a user joins a meeting, the backend returns all information needed to start signaling and WebRTC.

{
  "meetingId": "room_123",
  "user": {
    "id": "u1",
    "name": "Alex",
    "role": "host"
  },
  "signalingUrl": "wss://signal.app.com",
  "iceServers": [
    { "urls": "stun:stun.l.google.com:19302" },
    { "urls": "turn:turn.app.com", "username": "...", "credential": "..." }
  ],
  "sfuUrl": "wss://sfu.app.com",
  "token": "short_lived_jwt"
}

This response bootstraps the entire connection flow.

Chat Message Model

{
  "id": "msg1",
  "userId": "u1",
  "message": "Hello everyone",
  "createdAt": 1700000
}

Participant Model

{
  "id": "u1",
  "name": "Alex",
  "avatar": "avatar.jpg",
  "isMuted": false,
  "isVideoOn": true,
  "isScreenSharing": false,
  "isHost": true
}

2. Real-Time Signaling Events

Signaling messages flow through WebSocket.

Examples:

Participant Joined

{
  "type": "participant_joined",
  "participant": { ... }
}

Participant Left

{
  "type": "participant_left",
  "userId": "u2"
}

Mute / Camera Toggle

{
  "type": "participant_updated",
  "userId": "u2",
  "isMuted": true
}

{
  "type": "screen_share_started",
  "userId": "u1"
}

O — Optimisation, Security, Accessibility

Video calling is one of the most performance-sensitive frontend systems. Small inefficiencies can cause lag, dropped frames, or poor call quality.

Optimisation

Join Time Optimisation

Reducing time to join a meeting is critical.

Preload device permissions early
Preconnect to signaling and SFU domains
Lazy-load non-critical UI (chat, settings panels)
Show local preview before joining meeting

Goal: Reduce “time to first frame”.

Adaptive Video Rendering

Rendering multiple video streams is expensive.

Render only visible participant videos
Pause off-screen video tiles
Limit number of HD streams simultaneously
Prioritize active speaker video quality

Goal: Prevent CPU and GPU overload.

Efficient Re-Renders

Meeting state updates frequently.

Memoize video tiles
Avoid re-rendering entire grid on participant changes
Use requestAnimationFrame for layout transitions
Batch participant updates from signaling

Goal: Keep UI responsive during large meetings.

Bandwidth Optimisation

Network conditions vary widely.

Prefer lower quality video in grid view
Increase quality for active speaker
Automatically disable video on poor connections
Allow users to turn off incoming video

Goal: Maintain stable calls on weak networks.

Screen sharing requires different priorities.

Prioritize screen share stream quality
Reduce other participant video quality during sharing
Switch layout to presentation mode automatically

Goal: Ensure readable screen sharing.

Security

Secure Meeting Access

Meeting join tokens should be short-lived
Validate permissions before joining room
Prevent unauthorized participants

Encrypted Media

WebRTC provides built-in encryption using DTLS and SRTP, ensuring audio and video streams remain secure.

Token Protection

Store tokens in memory or HTTP-only cookies
Avoid exposing tokens in URLs
Rotate tokens during long meetings

Abuse Prevention

Rate limit chat messages and reactions
Allow host to remove participants
Provide meeting lock functionality

Keyboard Accessibility

Keyboard shortcuts for mute, camera, screen share
Accessible focus order for meeting controls

Announce participant join/leave events
Announce mute/unmute changes
Label all meeting controls clearly

Captions Readiness

Support live captions integration
Allow caption customization (size, contrast)

Visual Accessibility

High contrast mode support
Clear participant name labels
Visible focus indicators

Designing Zoom (Video Calling App) - Frontend System Design

R — Requirements

Functional Requirements

1. Core Meeting Flow

2. Audio & Video Controls

3. Participant Management

4. Screen Sharing

5. In-Meeting Chat

Non-Functional Requirements

1. Internationalization (i18n / L10n)

2. Offline & Network Handling

3. Security Considerations

4. Accessibility (a11y)

5. Performance Expectations

6. Recording & Media Controls (Optional Advanced Feature)

A — Architecture

1. SSR vs CSR vs SSG

2. REST APIs (App Backend)

3. Communication Protocol — WebRTC

Available Options Considered

Why HTTP Streaming Is Not Suitable

Why WebSockets Alone Are Not Suitable

Why WebRTC Is the Correct Choice

Ultra-Low Latency Media Transport

Built-In Encryption

Adaptive Bitrate & Congestion Control

NAT Traversal & Firewall Handling

Browser-Native Support

6. Peer-to-Peer vs SFU vs MCU

6.1 Mesh (Pure Peer-to-Peer)

Why Mesh Works for Very Small Calls

Why Mesh Fails for Group Calls

6.2 MCU (Multipoint Control Unit)

Advantages of MCU

Why MCU Is Not Ideal Today

6.3 SFU (Selective Forwarding Unit) — Chosen

Why SFU Is the Best Balance

Scales to Large Meetings

Lower Latency Than MCU

Dynamic Quality Selection (Simulcast)

Flexible Layout on Client

Why Modern Platforms Use SFU

7. Signaling Server

Why a Signaling Server Is Required

Responsibilities of the Signaling Server

Room and Participant Coordination

SDP Offer / Answer Exchange

ICE Candidate Exchange

Real-Time Meeting Events

Typical Connection Flow

1. User Joins Meeting via HTTP API

2. Client Opens WebSocket to Signaling Server

3. Server Sends Current Participant List

4. WebRTC Offer Creation (Client → SFU)

5. SFU Responds with SDP Answer

6. ICE Candidate Exchange

7. WebRTC Connection Established

8. Media Streaming Begins

9. Signaling Remains Active for Control Events

D — Data Model

Core Endpoints

Join Meeting Response (Important Payload)

Chat Message Model

Participant Model

2. Real-Time Signaling Events

Participant Joined

Participant Left

Mute / Camera Toggle

Screen Share Started

O — Optimisation, Security, Accessibility

Optimisation

Join Time Optimisation

Adaptive Video Rendering

Efficient Re-Renders

Bandwidth Optimisation

Screen Share Optimisation

Security

Secure Meeting Access

Encrypted Media

Token Protection