Lossless Network for RDMA White Paper
1. RDMA Over Converged Ethernet (RoCE)
1.1 Principles of RDMA
In traditional TCP/IP communication, data transmission between two hosts involves several steps:
- On the sender side:
- User-to-kernel memory copy: Application data is copied from user space to the socket send buffer in kernel space.
- Protocol stack encapsulation: The kernel protocol stack adds TCP/IP headers layer by layer to form a complete packet.
- DMA transfer: Data is transferred from the kernel buffer to the NIC queue via mechanisms like zero-copy (e.g., sendfile) or direct NIC DMA.
- On the receiver side:
- Hardware interrupt handling: Upon packet arrival, the NIC triggers an interrupt, and the kernel uses the NAPI mechanism to store the data into the DMA ring buffer.
- Protocol stack processing: The kernel decapsulates packet headers, performs checksum verification, and handles tasks such as out-of-order packet reassembly.
- Ready data copy: Complete message data is copied from the kernel socket receive buffer to user space memory.
- Context switch: The application is awakened from kernel mode to user mode to process the received data.
Performance bottlenecks include:
- Four context switches (two for send, two for receive).
- At least two full memory copies (user space to kernel space).
- Latency introduced by protocol stack processing (e.g., fragmentation/reassembly, ACK handling).
- Overhead from interrupt handling and software interrupt scheduling.
This architecture causes significant performance degradation in high-speed networking environments. RDMA (Remote Direct Memory Access), with its kernel bypass and zero-copy mechanisms, along with kernel stack optimization techniques like XDP and eBPF, addresses these issues.
A diagram illustrates the RDMA verbs interface and traditional socket interface, showing the flow from application to user space, through the kernel, TCP/IP layers, Ethernet, and finally to the RDMA-enabled NIC and I/O space.
1.2 Overview of RoCE (RDMA over Converged Ethernet)
RoCE is an Ethernet-based Remote Direct Memory Access (RDMA) protocol that enables applications to directly read and write memory between hosts without operating system intervention. This significantly reduces communication latency and CPU overhead. By offloading data transfer tasks to the network adapter, RoCE delivers higher throughput and lower system resource consumption.
1.3 Three Key Features of RoCEv2
RoCEv2 builds upon the original architecture by introducing the UDP/IP protocol stack, enabling Layer 3 routing capability. This makes RoCEv2 well-suited for large-scale deployments in modern data centers.
Advantages of Broadcom RoCEv2 Solution
Broadcom's Ethernet adapters offer hardware-level support for RoCEv2, enabling high-performance, low-latency network communication across a wide range of scenarios, including AI/ML workloads, distributed storage, and high-performance computing (HPC). Key benefits include:
- High Throughput: Leverages NIC hardware acceleration with support for multi-queue, high-concurrency data transfers.
- Low Latency: Bypasses the kernel protocol stack, significantly reducing end-to-end latency for applications.
- Low CPU Utilization: The communication data path is offloaded from the CPU, freeing up host processing resources.
Key Features:
- Ethernet-Based Routed Communication: RoCEv2 utilizes UDP/IP encapsulation, enabling it to be routed across Layer 3 networks, making it suitable for large-scale data center deployments.
- IP QoS Assurance: RoCEv2 supports DiffServ (DSCP) and VLAN priority (802.1p) for priority-based traffic scheduling and control, ensuring stable performance for latency-sensitive workloads.
- IP-Based Congestion Control: RoCEv2 leverages ECN (Explicit Congestion Notification) for sender-side rate adjustment based on real-time network feedback, improving transmission efficiency.
Data Center Bridging (DCB) configurations can be integrated to enhance RoCEv2 network architecture, enabling coordinated optimization of priority-based flow control (PFC), ECN-based congestion management, and queue scheduling strategies.
Diagrams illustrate the difference in CPU involvement between TCP traffic and RDMA traffic, highlighting that RDMA traffic does not involve the CPU.
Core Mechanism: ECN Marking and CNP
- ECN Marking (Congestion Experienced, CE): When congestion is detected, the ECN field in the IP header is set to '11' (CE). This marking is primarily done by switches or ECN-capable routers when buffer thresholds are exceeded.
- Congestion Notification Packet (CNP): Upon receiving an ECN-marked packet, the receiver-side RNIC generates a CNP and sends it back to the sender's RNIC to request a reduction in transmission rate.
1.4 Recommended RoCE RDMA Configuration Guidelines (Example)
A table outlines recommended settings for RoCE mode, ECN, CNP Response, PFC, Buffer, MTU, and DSCP/PCP.
Item | Recommended Setting |
---|---|
ROCE Mode | Use RoCEv2 (UDP encapsulation, ECN supported) |
ECN | Enable ECN marking on switches when buffer thresholds are exceeded |
CNP Response | Driver supports it by default; DCQCN needs to be enabled |
PFC | Enable PFC only for RDMA-specific traffic classes (e.g., TC1) |
Buffer | Allocate at least 512KB to 1MB per enabled RDMA traffic class (TC) |
MTU | Recommend enabling Jumbo Frames (e.g., MTU 9000) |
DSCP / PCP | Ensure consistent mapping: DSCP → TC → Priority → PFC |
2. RoCE Deployment on FS Switches
This solution supports low-latency, lossless Ethernet environments based on RoCEv2, ideal for application scenarios requiring high bandwidth and low latency, such as AI training, high-performance computing (HPC), and distributed storage.
2.1 Device Selection Guidelines (Recommended Models)
- Broadcom-based Switching ASICs: Ensure ROCE Feature Support, including IEEE 802.1Qbb PFC, ECN, and QoS scheduling. Support for ECMP, DCQCN, and other congestion control enhancements at the ASIC level. Broad compatibility with mainstream RDMA NIC vendors.
- PicOS: A Flexible and Stable Network OS: Supports hybrid CLI (Cisco-like syntax) and Linux Shell for versatile operations. Open network OS with support for Ansible, ZTP, SNMP, and custom scripting. Unified configuration of PFC, ECN, and priority queues for large-scale RDMA traffic environments.
A table lists various FS switch models (N9600-64OD, N8650-320D, N8520-32D, etc.) with their port configurations, switching capacity, feature support (RoCEv2, MLAG, EVPN-VXLAN, PFC, ECN, QoS), and recommended use cases (Cloud computing, AI/ML clusters, Edge/DC interconnect, Deep learning networks, Hyperscale data centers, RDMA storage clusters).
Another table details switch models by data rate (800G, 400G, 100/200G, 10/25G) and their features, including ASIC chips.
2.2 Switch Configuration Recommendations for RoCE Networks
A table provides recommended settings for Port MTU, RoCE Traffic Priority, PFC, ECN, DSCP/COS Mapping, QoS Queue Binding, Lossless Queue Routing, and Broadcast/Unknown Multicast.
2.3 RoCE Configuration Steps on FS Switches (Example: N8560/N9550 Series)
Step 1: Enable PFC (Priority Flow Control)
Configure PFC on all relevant ports to prevent packet loss during congestion. This involves creating a PFC configuration profile and applying it to specific interfaces. Commands are provided to display service statistics and verify PFC configuration.
Step 2: Configure PFC Buffers
Fine-tune buffer thresholds for priority queues to optimize buffer resource usage. Commands are provided to set MTU, guaranteed buffer limits, static thresholds, and offsets for PFC queues.
Step 3: Configure PFC Watchdog
The PFC Watchdog detects and recovers from PFC deadlock conditions. Commands are provided to enable the watchdog, set detection and recovery intervals, and view configuration and statistics.
Step 4: Configure ECN (Explicit Congestion Notification)
Continuously monitor network performance and ECN marking rates, adjusting ECN threshold values dynamically. Commands are provided to enable WRED, set maximum and minimum thresholds, packet drop probability, and enable ECN on queues.
Step 5: Enable Dynamic Load Balancing for ECMP
Enable dynamic load balancing for Equal-Cost Multi-Path (ECMP) routing to distribute traffic evenly across multiple member links, maximizing load balancing efficiency. A command is provided to set interface ECMP hash-mapping to dlb-normal.
2.4 RoCE EasyDeploy Initialization
RoCE EasyDeploy simplifies the deployment and configuration of RoCE on switches, enabling seamless integration with servers for optimized network performance. It allows easy switching between lossless and lossy modes.
Features & Benefits: Simplifies ROCE deployment, enables fast switching between modes, fewer configuration steps, flexible interface-level control, and enhances network stability.
Limitations & Guidelines: Supported on specific platforms, requires alignment of ECN and PFC queues with server configuration, supports post-deployment fine-tuning, and requires continuous monitoring of ROCE statistics.
Configuration Example: Commands are provided to set ROCE mode to lossless and apply it to all interfaces, followed by verification steps.
A section details ROCE PCP/DSCP->LP mapping and LP->FC mapping configurations.
3. RoCE Configuration on Ethernet NICs
3.1 Hardware Requirements
Proper hardware provisioning is essential for RDMA. Key requirements include:
- CPU Configuration: PCIe 5.0 Support with sufficient lanes, high-frequency, multi-core CPUs. NUMA optimization is recommended.
- Memory Configuration: High Bandwidth (DDR5 4800 MT/s or higher), sufficient capacity (128 GB minimum, 1 TB+ for large-scale tests), and ensuring 'ulimit -I unlimited' is set. NUMA Affinity is crucial.
- BIOS Configuration: Disable virtualization, set CPU mode to performance, disable C-States and P-States, and disable IOMMU or set to pass-through mode.
A table compares CPU configuration requirements for Low-Latency RDMA (HPC, Small Packet) and High-Throughput RDMA (AI/Storage, Large Packet).
Another table details memory configuration requirements for both categories.
3.2 NIC Selection
3.2.1 Broadcom
A table lists Broadcom NIC part numbers, ASIC chips, ports, and connectors, including models like BCM957504-P425G, BCM957508-P2100G, BCM957608-P1400G, etc.
Step 1: Install IP and RoCE Drivers
Instructions are provided for installing Broadcom drivers on Ubuntu 22.04 LTS, including downloading the driver package, extracting it, and running the installation script. Verification steps using 'dmesg' and 'modinfo' are also included.
Step 2: Manually Configure RoCE Settings
Details default RoCE configuration settings (RoCEv2 enabled, Congestion Control and PFC enabled, DSCP marking for traffic and CNP, MTU). Commands are provided to configure these settings, including ECN, PFC, and DSCP mapping.
Manually Modify NIC RoCE Configuration:
- DCQCN-ECN Configuration: Commands to query and set ROCE_CC_PRIO_MASK for DCQCN-ECN configuration.
- RoCE_np Settings: Commands to check and configure cnp_dscp and cnp_802p_prio for RoCE_np settings.
- PFC Configuration: Commands to check and configure PFC settings, including dscp2prio mapping and TC ratelimits.
3.2.2 NVIDIA
A table lists NVIDIA NIC part numbers, ASIC chips, ports, and connectors, including ConnectX-5, ConnectX-6, and ConnectX-7 series.
Step 1: Install IP and RoCE Drivers
Instructions for installing NVIDIA MLNX_OFED package on Ubuntu 22.04 LTS are provided, including downloading, extracting, and installing the package. Verification steps using 'dmesg' and 'modinfo' are included.
Step 2: Manually Configure RoCE Settings
Describes the Zero Touch RoCE (ZTR) solution for automatic configuration of PFC, ECN, and DSCP. Default RoCE configuration includes RoCEv2, PFC, ECN, DCQCN, and MTU settings. Commands are provided for manual configuration using 'mlx_qos' and 'mlxconfig' tools.
3.3 Verify RoCE Configuration
After driver installation and configuration, verify the RoCE setup by checking the GUID of the RoCE Interface using 'ibv_devices' and 'ibv_devinfo'. Commands are provided to check the perftest package utilities for traffic verification.
4 ROCE Performance Testing and Results
Performance measurements were conducted on a cluster using Broadcom/NVIDIA NICs and FS switches. A table summarizes the results.
Server | Switch | NIC | Benchmarks |
---|---|---|---|
Model: Dell R860 CPU: 6448H Memory Type: DDR4 5600 MT/s Memory: 512 GB (128GB/Socket) Kernel: 6.8.0-57-generic (Ubuntu22.04) |
Model: N9550-32D Hardware Revision: - Software Version: 4.5.0E/3b574830da |
Model: Broadcom P1400G Driver Version: 1.10.3-232.0.155.5 Firmware Version: 230.2.36.0/pkg 230.2.37.0 Congestion Control OSU: DCQCN-p Model: MCX75310AAS-NEAT Driver Version: 24.10-2.1.8 Firmware Version: 28.43.2566 Congestion Control OSU: DCQCN-p |
Perftest: 6.23 |
A diagram illustrates a spine-leaf network topology with FS switches and servers.
A table summarizes performance results for RDMA Point-to-Point Bandwidth, RDMA Point-to-Point Latency, RDMA Multi-Node Scalability, and Long-Term Stability Test, comparing Broadcom and NVIDIA NICs.
FS Offices
FS has several offices around the world. Addresses and phone numbers for Shenzhen, Shanghai, and Wuhan are provided. FS and the FS logo are trademarks or registered trademarks of FS.
File Info : application/pdf, 28 Pages, 8.89MB
DocumentDocumentReferences
EnterpriseSupport
Linux InfiniBand Drivers
Ethernet Physical Link Control Policy
Support Documents and Downloads
N8550-32C, 32-Port Ethernet Data Center Switch, 32 x 100Gb QSFP28, 2 x 10Gb SFP+, PicOS®, Broadcom Trident 3 Chip - FS.com Europe
N8560-32C, 32-Port Ethernet Data Center Switch, 32 x 100Gb QSFP28, PicOS®, Broadcom Trident 3 Chip - FS.com
N9550-64D, 4U, 64-Port Ethernet HPC/AI Data Center Switch, 64 x 400Gb QSFP-DD, PicOS®, Broadcom Tomahawk 4 Chip - FS.com
N8550-24CD8D, 24-Port Ethernet Data Center Switch, 24 x 200Gb QSFP56, with 8 x 400Gb QSFP-DD Uplinks, PicOS®, Broadcom Trident 4 Chip - FS.com Europe
N5570-48S6C, 48-Port Ethernet Data Center Switch, 48 x 10Gb SFP+, with 6 x 100Gb QSFP28 Uplinks, PicOS®, Broadcom Trident 3 Chip - FS.com
N5850-48X6C, 48-Port Ethernet Data Center Switch, 48 x 10G RJ45, with 6 x 100G QSFP28 Uplinks, PicOS®, Broadcom Trident 3 Chip - FS.com Europe
N9600-64OD, 64-Port Ethernet HPC/AI Data Center Switch, 64 x 800Gb OSFP, PicOS®, Broadcom Tomahawk 5 Chip - FS.com
N8650-32OD, 32-Port Ethernet HPC/AI Data Center Switch, 32 x 800Gb OSFP, PicOS®, Broadcom Tomahawk 5 Chip - FS.com
N8610-64D, 2U, 64-Port Ethernet HPC/AI Data Center Switch, 64 x 400G QSFP-DD, PicOS®, Broadcom Tomahawk 4 Chip - FS.com Europe
Related Documents
![]() |
FS.com HPC Case Study: Enhancing AI Education with InfiniBand Networking Discover how FS.com's 400G InfiniBand networking solution, featuring NVIDIA H100 technology, empowered a South Korean educational company to scale its AI-driven learning platform, improve performance, and reduce costs. |
![]() |
IxNetwork RFC2544 Throughput and Latency Test Report for S5800-24T8S Switch Detailed report of RFC2544 throughput and latency testing performed on the S5800-24T8S Switch using IxNetwork, covering frame loss, transmission rates, and latency measurements across various frame sizes. |
![]() |
FHD Fiber Cabling System Guide | FS.COM A comprehensive guide to the FS.COM FHD Fiber Cabling System, detailing its components, applications, migration capabilities, and technical specifications for modern data center and enterprise network infrastructure. |
![]() |
FSOS NVGRE Command Line Reference for FS S5800-8TF12S Switch This document provides a comprehensive reference for FSOS NVGRE commands on the FS S5800-8TF12S network switch. It details configuration, syntax, and usage for managing NVGRE tunnels and related features. |
![]() |
FS S8050-20Q4C 40G IP Storage Switch Datasheet Datasheet for the FS S8050-20Q4C, a high-performance Ethernet switch designed for next-generation Metro, Data Center, and Enterprise network requirements. Features include L2/L3 capabilities, comprehensive protocols, and flexible software options for rapid service deployment and management. |
![]() |
S3410C-16TF PicOS® Switch Hardware Guide Comprehensive hardware guide for the FS.COM S3410C-16TF PicOS® Switch, detailing installation, configuration, and operational specifications. Learn about its Broadcom BCM56150 chip, 16x 1Gb RJ45 and 2x 1Gb SFP ports, MLAG, and AmpCon-Campus management platform. |
![]() |
FS S3240C Series Switches Quick Start Guide - Installation and Configuration A concise guide to installing and performing the initial configuration of FS S3240C Series Network Switches, covering hardware, connections, and basic setup. |
![]() |
FS.COM 400G Transceiver Upgrade for Thai Telecom Data Center Case study detailing how a leading Thai telecommunications provider upgraded its network to 400G using FS.COM's high-performance optical transceivers, enhancing speed, reliability, and efficiency for data center interconnects. |