R — Requirements

Functional Requirements

1. Core Meeting Flow

The application should support:

  • Create a meeting room
  • Join a meeting using room ID or link
  • Leave meeting
  • End meeting (host only)
  • Rejoin automatically after temporary disconnect

2. Audio & Video Controls

Participants should be able to:

  • Turn camera on/off
  • Mute/unmute microphone
  • Switch camera or microphone device
  • See video streams of other participants
  • View mute/video status of others

3. Participant Management

The UI should support:

  • Dynamic grid layout based on participant count
  • Active speaker detection
  • Display participant names
  • Host controls (mute participant, remove participant)
  • Raise hand indicator

4. Screen Sharing

Participants should be able to:

  • Share entire screen or specific window
  • Stop screen sharing
  • View screen share in primary layout
  • Switch between grid and presentation mode

5. In-Meeting Chat

Participants should be able to:

  • Send text messages
  • Receive messages in real time
  • Send reactions (emoji)
  • View message history during session

Non-Functional Requirements

1. Internationalization (i18n / L10n)

  • Multi-language UI
  • Localized time formats
  • Support for captions in multiple languages

2. Offline & Network Handling

  • Automatic reconnection on network drop
  • Graceful degradation on low bandwidth
  • Maintain call state during short disconnects

3. Security Considerations

  • Secure meeting access (room tokens / passwords)
  • Encrypted media streams
  • Permission-based device access
  • Prevent unauthorized participants

4. Accessibility (a11y)

  • Keyboard controls for meeting actions
  • Screen reader support for controls
  • Clear focus management
  • High contrast mode support

5. Performance Expectations

  • Low join latency
  • Minimal audio/video lag
  • Stable frame rate
  • Smooth layout transitions during participant changes

6. Recording & Media Controls (Optional Advanced Feature)

  • Start/stop recording (host)
  • Show recording indicator
  • Display network quality indicator

A — Architecture

This section defines the major architectural decisions for a browser-based video calling system and why each choice fits real-time communication.


1. SSR vs CSR vs SSG

Chosen: CSR

Area

Rendering Strategy

Why

Video call experience

CSR

Depends on browser media APIs

Meeting landing page

SSR (optional)

Faster initial load

Video calling relies heavily on browser APIs like camera, microphone, and screen capture. Rendering the call UI on the server provides no real value.

2. REST APIs (App Backend)

REST APIs are used for:

  • Creating and joining meeting rooms
  • Fetching participant metadata
  • Authentication and meeting permissions
  • Sending chat messages and reactions

These APIs handle room lifecycle and meeting metadata, not media streaming.

3. Communication Protocol — WebRTC

Chosen: WebRTC (Web Real-Time Communication) for audio and video streaming

Real-time video calling has extremely strict latency and reliability requirements. The communication protocol must support continuous, bi-directional media transfer with minimal delay, automatic network adaptation, and strong security.

Available Options Considered

  • HTTP streaming (HLS / DASH)
  • WebSockets
  • WebRTC

Why HTTP Streaming Is Not Suitable

HTTP streaming is designed for watching video, not for talking.

It works by downloading small video chunks and buffering several seconds ahead of playback. This buffering introduces multi-second latency, which is acceptable for streaming platforms but unacceptable for conversations.

In a video call, even a 1–2 second delay breaks natural interaction.

Why WebSockets Alone Are Not Suitable

WebSockets provide a persistent connection and low-latency messaging, but they are not designed for media streaming.

Using WebSockets for video would require building:

  • Custom video encoding and decoding
  • Congestion control algorithms
  • Adaptive bitrate logic
  • Packet loss recovery
  • Jitter buffering

This would effectively mean rebuilding a media streaming protocol from scratch.

Why WebRTC Is the Correct Choice

WebRTC is specifically designed for real-time communication in browsers and already solves the hard problems required for video calling.

Ultra-Low Latency Media Transport

WebRTC is built for real-time interaction and typically delivers sub-second latency.
This allows natural conversation without noticeable delay.

Built-In Encryption

All WebRTC media streams are encrypted by default using:

  • DTLS for key exchange
  • SRTP for media transport

This ensures audio, video, and screen sharing are secure without additional implementation.

Adaptive Bitrate & Congestion Control

Network conditions constantly change during calls. WebRTC automatically:

  • Adjusts video resolution
  • Adjusts bitrate and frame rate
  • Recovers from packet loss
  • Handles jitter and unstable networks

This keeps calls stable even on weak connections.

NAT Traversal & Firewall Handling

Most users are behind routers and firewalls. WebRTC includes ICE, STUN, and TURN mechanisms to establish connections even in restricted network environments.

Without this, most peer-to-peer connections would fail.

Browser-Native Support

Modern browsers provide built-in APIs for:

  • Camera and microphone access
  • Screen sharing
  • Media encoding and decoding
  • Network transport

This allows real-time calls without plugins or external software.

6. Peer-to-Peer vs SFU vs MCU

This is the most critical decision in video calling architecture because it determines scalability, latency, bandwidth usage, and infrastructure cost.

There are three major ways multi-participant calls can be built.

6.1 Mesh (Pure Peer-to-Peer)

In a mesh architecture, every participant sends their audio/video stream directly to every other participant.

If there are N participants, each user sends N-1 streams and receives N-1 streams.

Example with 5 people in a meeting:

  • Each user uploads 4 video streams
  • Each user downloads 4 video streams

Total connections grow rapidly as participants increase.

Why Mesh Works for Very Small Calls

For 1-to-1 calls:

  • Only one connection is required
  • No server bandwidth needed
  • Lowest infrastructure cost

This is why WebRTC originally started as a peer-to-peer technology.

Why Mesh Fails for Group Calls

Bandwidth grows exponentially.

For N users:

  • Upload bandwidth per user ≈ (N-1) streams
  • CPU encoding cost multiplies per stream

With 10 participants, each user would need to:

  • Encode and upload 9 video streams simultaneously
  • Download 9 streams

Most laptops and networks cannot sustain this.

Mesh becomes impractical beyond ~4 participants.

ChatGPT Image Feb 28, 2026, 05_40_41 PM.png



6.2 MCU (Multipoint Control Unit)

In an MCU architecture, all participants send their video to a central server.
The server mixes all streams into a single combined video and sends one stream back to each user.

So each user:

  • Uploads 1 stream
  • Downloads 1 combined stream

Advantages of MCU

  • Very simple for clients
  • Low bandwidth usage on user devices
  • Works well on weak devices

This was common in early enterprise video conferencing systems.

Why MCU Is Not Ideal Today

The server must:

  • Decode every participant’s video
  • Mix all videos into a single composition
  • Re-encode the final video for every participant

This is extremely CPU and GPU intensive.

Problems:

  • Very high infrastructure cost
  • Increased latency due to encoding pipeline
  • Fixed layout (less flexibility for UI)
  • Hard to support dynamic layouts (speaker view, grid view)

MCU pushes too much work to the server.

ChatGPT Image Feb 28, 2026, 05_50_00 PM.png



6.3 SFU (Selective Forwarding Unit) — Chosen

An SFU acts as a smart router for media streams.

Participants send their video stream to the SFU.
The SFU does not decode or mix video.
It simply forwards streams to other participants.

So each user:

  • Uploads 1 stream
  • Downloads multiple streams (one per participant)

But the heavy encoding work stays on the client, not the server.

Why SFU Is the Best Balance

Scales to Large Meetings

Users upload only one stream regardless of meeting size.
The SFU distributes streams efficiently to all participants.

This removes the exponential bandwidth problem of mesh.

Lower Latency Than MCU

Since the SFU does not decode or re-encode video:

  • No heavy media processing on the server
  • Streams are forwarded quickly
  • Lower end-to-end latency

Dynamic Quality Selection (Simulcast)

Clients can send multiple quality layers of the same video:

  • Low resolution
  • Medium resolution
  • High resolution

The SFU decides which quality to forward based on:

  • Participant screen size
  • Network quality
  • Active speaker priority

For example:

  • Active speaker → high quality
  • Small grid tiles → low quality

This saves bandwidth and improves performance.

Flexible Layout on Client

Since streams are not mixed, each client can decide:

  • Grid view
  • Speaker view
  • Screen share priority

This enables modern meeting UI layouts.

ChatGPT Image Feb 28, 2026, 05_49_55 PM.png
It is called a Selective Forwarding Unit because the server does not mix or process video like an MCU; instead, it simply receives media streams from participants and forwards them to others. The “selective” part comes from the fact that it decides which streams to send, to whom, and in what quality. For example, the active speaker may be forwarded in high quality while other participants are sent in lower quality or not sent at all, depending on screen size, bandwidth, and device capability. This smart forwarding reduces bandwidth usage and allows video calls to scale efficiently.

Why Modern Platforms Use SFU

Zoom, Google Meet, Microsoft Teams, and Discord all use SFU-based architectures because it provides the best balance between:

  • Scalability
  • Cost
  • Latency
  • Flexibility

SFU has become the industry standard for large real-time meetings.


7. Signaling Server

The signaling server is the coordination layer that helps participants discover each other and establish WebRTC connections before any audio/video flows. WebRTC handles media transport, but it does not define how peers find each other or exchange connection metadata. That responsibility belongs to the signaling service.

Why a Signaling Server Is Required

Before two browsers can send audio/video, they must agree on:

  • Who is in the meeting
  • What media capabilities each device supports
  • How to reach each other across NAT/firewalls
  • Which codecs and encryption settings to use

This coordination must happen in real time, which is why signaling typically uses WebSockets.

Once the connection is established, the signaling server steps out of the media path.

Responsibilities of the Signaling Server

Room and Participant Coordination

The signaling service manages meeting rooms and participant presence.

It handles events such as:

  • User joins room
  • User leaves room
  • Participant list updates
  • Host actions (mute all, remove participant)

This allows every client to maintain a real-time view of the meeting state.


SDP Offer / Answer Exchange

To start a WebRTC connection, peers must exchange SDP messages that describe their media capabilities.

The signaling server relays these messages between participants or between client and SFU:

  • Offer: “Here is how I can send/receive media.”
  • Answer: “Here is what I support. Let’s connect.”

This negotiation step ensures both sides agree on codecs, encryption, and media directions.


ICE Candidate Exchange

Most users are behind routers and firewalls.
To find a working network path, WebRTC gathers ICE candidates (possible connection routes).

The signaling server relays these candidates so peers can discover a viable route.

Without this step, most WebRTC connections would fail.


Real-Time Meeting Events

The signaling server also distributes lightweight meeting events:

  • Mute/unmute notifications
  • Camera on/off updates
  • Screen share start/stop
  • Active speaker signals
  • Raise hand / reactions

These events must reach all participants instantly, making WebSockets ideal.


Typical Connection Flow

ZoomVideoParticipant ListLive chathide videoswitch videoImageNameIdEmojiSend buttonMessageCamera moduleUISignaling serviceBackendSFUWebRTCHttp2web socket

1. User Joins Meeting via HTTP API

When a user clicks a meeting link:

  • The browser sends an HTTP request to the backend.
  • The backend authenticates the user.
  • The backend verifies meeting permissions.
  • The backend returns:
    • Meeting ID
    • User identity
    • Access token
    • Signaling server URL
    • ICE server configuration (STUN/TURN)

At this stage, no media connection exists yet.


2. Client Opens WebSocket to Signaling Server

The browser establishes a persistent WebSocket connection to the signaling server.

This connection is used for:

  • Real-time room coordination
  • Connection negotiation
  • Participant event broadcasting

Now the client is officially connected to the meeting room.


3. Server Sends Current Participant List

The signaling server responds with:

  • List of participants currently in the room
  • Host information
  • Active speaker (if any)
  • Ongoing screen share status

The client updates its UI accordingly.


4. WebRTC Offer Creation (Client → SFU)

Because we are using SFU architecture:

  • The client creates a WebRTC PeerConnection.
  • The client generates an SDP (Session Description Protocol) offer describing:
    • Supported codecs
    • Media tracks (audio/video)
    • Encryption parameters
    • Network candidates

The offer is sent via WebSocket to the signaling server, which forwards it to the SFU.


5. SFU Responds with SDP Answer

The SFU:

  • Reviews the client’s capabilities
  • Decides media routing strategy
  • Returns an SDP answer

The signaling server forwards this answer back to the client.

Now both sides agree on how media will be exchanged.


6. ICE Candidate Exchange

Both client and SFU gather network candidates:

  • Public IP candidates
  • Relay (TURN) candidates
  • Local network candidates

These candidates are exchanged via the signaling server.

This step ensures:

  • NAT traversal
  • Firewall bypass
  • Reliable connectivity

Once a working candidate pair is found, the WebRTC connection becomes active.


7. WebRTC Connection Established

Now:

  • The client starts sending audio/video tracks to the SFU.
  • The SFU begins forwarding other participants' streams to this client.

Media traffic now flows directly between:

Client ↔ SFU

Signaling server is no longer involved in media transport.


8. Media Streaming Begins

At this point:

  • Camera video flows continuously.
  • Microphone audio flows continuously.
  • Screen share (if any) flows as a separate media track.
  • WebRTC handles congestion control and bitrate adaptation automatically.

The call is now fully active.


9. Signaling Remains Active for Control Events

Even after media starts flowing, the WebSocket signaling connection remains open.

It handles:

  • New participant joins
  • Participant leaves
  • Mute/unmute events
  • Screen share start/stop
  • Host actions
  • Reconnection events

It acts as the control plane, while WebRTC handles the media plane.


D — Data Model

This section defines how data flows between the frontend, backend, signaling service, and SFU.
For a video calling system, the data layer focuses on:

  • Meeting lifecycle
  • Participant state
  • Chat and reactions
  • Real-time meeting events

Core Endpoints

Method

Endpoint

Purpose

POST

/api/meetings

Create meeting

GET

/api/meetings/:meetingId

Fetch meeting details

POST

/api/meetings/:meetingId/join

Join meeting and receive tokens

POST

/api/meetings/:meetingId/leave

Leave meeting

POST

/api/meetings/:meetingId/end

End meeting (host)

GET

/api/meetings/:meetingId/participants

Fetch participant list

GET

/api/meetings/:meetingId/chat

Fetch chat history

POST

/api/meetings/:meetingId/chat

Send chat message

Join Meeting Response (Important Payload)

When a user joins a meeting, the backend returns all information needed to start signaling and WebRTC.

{
"meetingId": "room_123",
"user": {
"id": "u1",
"name": "Alex",
"role": "host"
},
"signalingUrl": "wss://signal.app.com",
"iceServers": [
{ "urls": "stun:stun.l.google.com:19302" },
{ "urls": "turn:turn.app.com", "username": "...", "credential": "..." }
],
"sfuUrl": "wss://sfu.app.com",
"token": "short_lived_jwt"
}

This response bootstraps the entire connection flow.


Chat Message Model

{
"id": "msg1",
"userId": "u1",
"message": "Hello everyone",
"createdAt": 1700000
}

Participant Model

{
"id": "u1",
"name": "Alex",
"avatar": "avatar.jpg",
"isMuted": false,
"isVideoOn": true,
"isScreenSharing": false,
"isHost": true
}

2. Real-Time Signaling Events

Signaling messages flow through WebSocket.

Examples:

Participant Joined

{
"type": "participant_joined",
"participant": { ... }
}

Participant Left

{
"type": "participant_left",
"userId": "u2"
}

Mute / Camera Toggle

{
"type": "participant_updated",
"userId": "u2",
"isMuted": true
}

Screen Share Started

{
"type": "screen_share_started",
"userId": "u1"
}


O — Optimisation, Security, Accessibility

Video calling is one of the most performance-sensitive frontend systems. Small inefficiencies can cause lag, dropped frames, or poor call quality.

Optimisation

Join Time Optimisation

Reducing time to join a meeting is critical.

  • Preload device permissions early
  • Preconnect to signaling and SFU domains
  • Lazy-load non-critical UI (chat, settings panels)
  • Show local preview before joining meeting

Goal: Reduce “time to first frame”.

Adaptive Video Rendering

Rendering multiple video streams is expensive.

  • Render only visible participant videos
  • Pause off-screen video tiles
  • Limit number of HD streams simultaneously
  • Prioritize active speaker video quality

Goal: Prevent CPU and GPU overload.

Efficient Re-Renders

Meeting state updates frequently.

  • Memoize video tiles
  • Avoid re-rendering entire grid on participant changes
  • Use requestAnimationFrame for layout transitions
  • Batch participant updates from signaling

Goal: Keep UI responsive during large meetings.

Bandwidth Optimisation

Network conditions vary widely.

  • Prefer lower quality video in grid view
  • Increase quality for active speaker
  • Automatically disable video on poor connections
  • Allow users to turn off incoming video

Goal: Maintain stable calls on weak networks.

Screen Share Optimisation

Screen sharing requires different priorities.

  • Prioritize screen share stream quality
  • Reduce other participant video quality during sharing
  • Switch layout to presentation mode automatically

Goal: Ensure readable screen sharing.

Security

Secure Meeting Access

  • Meeting join tokens should be short-lived
  • Validate permissions before joining room
  • Prevent unauthorized participants

Encrypted Media

WebRTC provides built-in encryption using DTLS and SRTP, ensuring audio and video streams remain secure.

Token Protection

  • Store tokens in memory or HTTP-only cookies
  • Avoid exposing tokens in URLs
  • Rotate tokens during long meetings

Abuse Prevention

  • Rate limit chat messages and reactions
  • Allow host to remove participants
  • Provide meeting lock functionality

Accessibility (a11y)

Keyboard Accessibility

  • Keyboard shortcuts for mute, camera, screen share
  • Accessible focus order for meeting controls

Screen Reader Support

  • Announce participant join/leave events
  • Announce mute/unmute changes
  • Label all meeting controls clearly

Captions Readiness

  • Support live captions integration
  • Allow caption customization (size, contrast)

Visual Accessibility

  • High contrast mode support
  • Clear participant name labels
  • Visible focus indicators