Lossless Network for RDMA White Paper

1. RDMA Over Converged Ethernet (RoCE)

1.1 Principles of RDMA

In traditional TCP/IP communication, data transmission between two hosts involves several steps:

Performance bottlenecks include:

This architecture causes significant performance degradation in high-speed networking environments. RDMA (Remote Direct Memory Access), with its kernel bypass and zero-copy mechanisms, along with kernel stack optimization techniques like XDP and eBPF, addresses these issues.

A diagram illustrates the RDMA verbs interface and traditional socket interface, showing the flow from application to user space, through the kernel, TCP/IP layers, Ethernet, and finally to the RDMA-enabled NIC and I/O space.

1.2 Overview of RoCE (RDMA over Converged Ethernet)

RoCE is an Ethernet-based Remote Direct Memory Access (RDMA) protocol that enables applications to directly read and write memory between hosts without operating system intervention. This significantly reduces communication latency and CPU overhead. By offloading data transfer tasks to the network adapter, RoCE delivers higher throughput and lower system resource consumption.

1.3 Three Key Features of RoCEv2

RoCEv2 builds upon the original architecture by introducing the UDP/IP protocol stack, enabling Layer 3 routing capability. This makes RoCEv2 well-suited for large-scale deployments in modern data centers.

Advantages of Broadcom RoCEv2 Solution

Broadcom's Ethernet adapters offer hardware-level support for RoCEv2, enabling high-performance, low-latency network communication across a wide range of scenarios, including AI/ML workloads, distributed storage, and high-performance computing (HPC). Key benefits include:

Key Features:

Data Center Bridging (DCB) configurations can be integrated to enhance RoCEv2 network architecture, enabling coordinated optimization of priority-based flow control (PFC), ECN-based congestion management, and queue scheduling strategies.

Diagrams illustrate the difference in CPU involvement between TCP traffic and RDMA traffic, highlighting that RDMA traffic does not involve the CPU.

Core Mechanism: ECN Marking and CNP

1.4 Recommended RoCE RDMA Configuration Guidelines (Example)

A table outlines recommended settings for RoCE mode, ECN, CNP Response, PFC, Buffer, MTU, and DSCP/PCP.

Item Recommended Setting
ROCE Mode Use RoCEv2 (UDP encapsulation, ECN supported)
ECN Enable ECN marking on switches when buffer thresholds are exceeded
CNP Response Driver supports it by default; DCQCN needs to be enabled
PFC Enable PFC only for RDMA-specific traffic classes (e.g., TC1)
Buffer Allocate at least 512KB to 1MB per enabled RDMA traffic class (TC)
MTU Recommend enabling Jumbo Frames (e.g., MTU 9000)
DSCP / PCP Ensure consistent mapping: DSCP → TC → Priority → PFC

2. RoCE Deployment on FS Switches

This solution supports low-latency, lossless Ethernet environments based on RoCEv2, ideal for application scenarios requiring high bandwidth and low latency, such as AI training, high-performance computing (HPC), and distributed storage.

2.1 Device Selection Guidelines (Recommended Models)

A table lists various FS switch models (N9600-64OD, N8650-320D, N8520-32D, etc.) with their port configurations, switching capacity, feature support (RoCEv2, MLAG, EVPN-VXLAN, PFC, ECN, QoS), and recommended use cases (Cloud computing, AI/ML clusters, Edge/DC interconnect, Deep learning networks, Hyperscale data centers, RDMA storage clusters).

Another table details switch models by data rate (800G, 400G, 100/200G, 10/25G) and their features, including ASIC chips.

2.2 Switch Configuration Recommendations for RoCE Networks

A table provides recommended settings for Port MTU, RoCE Traffic Priority, PFC, ECN, DSCP/COS Mapping, QoS Queue Binding, Lossless Queue Routing, and Broadcast/Unknown Multicast.

2.3 RoCE Configuration Steps on FS Switches (Example: N8560/N9550 Series)

Step 1: Enable PFC (Priority Flow Control)

Configure PFC on all relevant ports to prevent packet loss during congestion. This involves creating a PFC configuration profile and applying it to specific interfaces. Commands are provided to display service statistics and verify PFC configuration.

Step 2: Configure PFC Buffers

Fine-tune buffer thresholds for priority queues to optimize buffer resource usage. Commands are provided to set MTU, guaranteed buffer limits, static thresholds, and offsets for PFC queues.

Step 3: Configure PFC Watchdog

The PFC Watchdog detects and recovers from PFC deadlock conditions. Commands are provided to enable the watchdog, set detection and recovery intervals, and view configuration and statistics.

Step 4: Configure ECN (Explicit Congestion Notification)

Continuously monitor network performance and ECN marking rates, adjusting ECN threshold values dynamically. Commands are provided to enable WRED, set maximum and minimum thresholds, packet drop probability, and enable ECN on queues.

Step 5: Enable Dynamic Load Balancing for ECMP

Enable dynamic load balancing for Equal-Cost Multi-Path (ECMP) routing to distribute traffic evenly across multiple member links, maximizing load balancing efficiency. A command is provided to set interface ECMP hash-mapping to dlb-normal.

2.4 RoCE EasyDeploy Initialization

RoCE EasyDeploy simplifies the deployment and configuration of RoCE on switches, enabling seamless integration with servers for optimized network performance. It allows easy switching between lossless and lossy modes.

Features & Benefits: Simplifies ROCE deployment, enables fast switching between modes, fewer configuration steps, flexible interface-level control, and enhances network stability.

Limitations & Guidelines: Supported on specific platforms, requires alignment of ECN and PFC queues with server configuration, supports post-deployment fine-tuning, and requires continuous monitoring of ROCE statistics.

Configuration Example: Commands are provided to set ROCE mode to lossless and apply it to all interfaces, followed by verification steps.

A section details ROCE PCP/DSCP->LP mapping and LP->FC mapping configurations.

3. RoCE Configuration on Ethernet NICs

3.1 Hardware Requirements

Proper hardware provisioning is essential for RDMA. Key requirements include:

A table compares CPU configuration requirements for Low-Latency RDMA (HPC, Small Packet) and High-Throughput RDMA (AI/Storage, Large Packet).

Another table details memory configuration requirements for both categories.

3.2 NIC Selection

3.2.1 Broadcom

A table lists Broadcom NIC part numbers, ASIC chips, ports, and connectors, including models like BCM957504-P425G, BCM957508-P2100G, BCM957608-P1400G, etc.

Step 1: Install IP and RoCE Drivers

Instructions are provided for installing Broadcom drivers on Ubuntu 22.04 LTS, including downloading the driver package, extracting it, and running the installation script. Verification steps using 'dmesg' and 'modinfo' are also included.

Step 2: Manually Configure RoCE Settings

Details default RoCE configuration settings (RoCEv2 enabled, Congestion Control and PFC enabled, DSCP marking for traffic and CNP, MTU). Commands are provided to configure these settings, including ECN, PFC, and DSCP mapping.

Manually Modify NIC RoCE Configuration:

3.2.2 NVIDIA

A table lists NVIDIA NIC part numbers, ASIC chips, ports, and connectors, including ConnectX-5, ConnectX-6, and ConnectX-7 series.

Step 1: Install IP and RoCE Drivers

Instructions for installing NVIDIA MLNX_OFED package on Ubuntu 22.04 LTS are provided, including downloading, extracting, and installing the package. Verification steps using 'dmesg' and 'modinfo' are included.

Step 2: Manually Configure RoCE Settings

Describes the Zero Touch RoCE (ZTR) solution for automatic configuration of PFC, ECN, and DSCP. Default RoCE configuration includes RoCEv2, PFC, ECN, DCQCN, and MTU settings. Commands are provided for manual configuration using 'mlx_qos' and 'mlxconfig' tools.

3.3 Verify RoCE Configuration

After driver installation and configuration, verify the RoCE setup by checking the GUID of the RoCE Interface using 'ibv_devices' and 'ibv_devinfo'. Commands are provided to check the perftest package utilities for traffic verification.

4 ROCE Performance Testing and Results

Performance measurements were conducted on a cluster using Broadcom/NVIDIA NICs and FS switches. A table summarizes the results.

Server Switch NIC Benchmarks
Model: Dell R860
CPU: 6448H
Memory Type: DDR4 5600 MT/s
Memory: 512 GB (128GB/Socket)
Kernel: 6.8.0-57-generic (Ubuntu22.04)
Model: N9550-32D
Hardware Revision: -
Software Version: 4.5.0E/3b574830da
Model: Broadcom P1400G
Driver Version: 1.10.3-232.0.155.5
Firmware Version: 230.2.36.0/pkg 230.2.37.0
Congestion Control OSU: DCQCN-p
Model: MCX75310AAS-NEAT
Driver Version: 24.10-2.1.8
Firmware Version: 28.43.2566
Congestion Control OSU: DCQCN-p
Perftest: 6.23

A diagram illustrates a spine-leaf network topology with FS switches and servers.

A table summarizes performance results for RDMA Point-to-Point Bandwidth, RDMA Point-to-Point Latency, RDMA Multi-Node Scalability, and Long-Term Stability Test, comparing Broadcom and NVIDIA NICs.

FS Offices

FS has several offices around the world. Addresses and phone numbers for Shenzhen, Shanghai, and Wuhan are provided. FS and the FS logo are trademarks or registered trademarks of FS.

Models: Lossless Network for RDMA White Paper, Network for RDMA White Paper, RDMA White Paper, White Paper

File Info : application/pdf, 28 Pages, 8.89MB

PDF preview unavailable. Download the PDF instead.

cn lossless-network-for-rdma-white-paper-20250626115109

References

iText Core 7.2.4 (AGPL version), pdfHTML 4.0.4 (AGPL version) ©2000-2022 iText Group NV; modified using iText Core 7.2.4 (AGPL version) ©2000-2022 iText Group NV; modified using iText Core 7.2.4 (AGPL version) ©2000-2022 iText Group NV

Related Documents

Preview FS.com HPC Case Study: Enhancing AI Education with InfiniBand Networking
Discover how FS.com's 400G InfiniBand networking solution, featuring NVIDIA H100 technology, empowered a South Korean educational company to scale its AI-driven learning platform, improve performance, and reduce costs.
Preview IxNetwork RFC2544 Throughput and Latency Test Report for S5800-24T8S Switch
Detailed report of RFC2544 throughput and latency testing performed on the S5800-24T8S Switch using IxNetwork, covering frame loss, transmission rates, and latency measurements across various frame sizes.
Preview FHD Fiber Cabling System Guide | FS.COM
A comprehensive guide to the FS.COM FHD Fiber Cabling System, detailing its components, applications, migration capabilities, and technical specifications for modern data center and enterprise network infrastructure.
Preview FSOS NVGRE Command Line Reference for FS S5800-8TF12S Switch
This document provides a comprehensive reference for FSOS NVGRE commands on the FS S5800-8TF12S network switch. It details configuration, syntax, and usage for managing NVGRE tunnels and related features.
Preview FS S8050-20Q4C 40G IP Storage Switch Datasheet
Datasheet for the FS S8050-20Q4C, a high-performance Ethernet switch designed for next-generation Metro, Data Center, and Enterprise network requirements. Features include L2/L3 capabilities, comprehensive protocols, and flexible software options for rapid service deployment and management.
Preview S3410C-16TF PicOS® Switch Hardware Guide
Comprehensive hardware guide for the FS.COM S3410C-16TF PicOS® Switch, detailing installation, configuration, and operational specifications. Learn about its Broadcom BCM56150 chip, 16x 1Gb RJ45 and 2x 1Gb SFP ports, MLAG, and AmpCon-Campus management platform.
Preview FS S3240C Series Switches Quick Start Guide - Installation and Configuration
A concise guide to installing and performing the initial configuration of FS S3240C Series Network Switches, covering hardware, connections, and basic setup.
Preview FS.COM 400G Transceiver Upgrade for Thai Telecom Data Center
Case study detailing how a leading Thai telecommunications provider upgraded its network to 400G using FS.COM's high-performance optical transceivers, enhancing speed, reliability, and efficiency for data center interconnects.