Demystifying Ultra Ethernet

By Tom Emmons and John Peach

Introduction

In the early era of AI/ML, clusters were treated as specialist technology islands, developed independently from traditional services and networks. Today, AI/ML is becoming business-critical, requiring a common technology paradigm that fits within established enterprise parameters. This necessitates a model that supports accelerated compute for multiple workload types, built and operated using tried-and-trusted technologies, benefiting from the same skill sets, economics, and openness that underpin global networks.

Ethernet and IP have proven their versatility over the last 50 years, adapting to replace and improve upon legacy technologies. Ethernet is now the interconnect technology of choice for the majority of AI accelerators (XPUs). Advanced networking solutions, such as Arista's Etherlink™ portfolio, already outperform legacy proprietary interconnect technologies. Ultra Ethernet represents the next natural evolution in ubiquitous networking.

The Ultra Ethernet Consortium (UEC), of which Arista is a founding member, aims to enhance Ethernet for AI and HPC needs. Over 100 member companies and 1000 participants collaborated on the 1.0 specification, which is now driving hardware implementations to elevate cluster performance using advanced Ethernet switching platforms.

Core UEC Vision: Reimagining RDMA

Remote Direct Memory Access (RDMA) is critical for AI and HPC applications, enabling systems and processors to exchange data at high speeds (400 Gbps today, 800 Gbps in the near future). This efficient communication facilitates workload distribution across multiple servers and processors with minimal performance cost, enabling parallel computation for models spanning thousands of accelerators.

Historically, high flow rates and synchronized large-volume RDMA traffic created challenges for Ethernet networks, leading to hashing issues and congestion. While Arista's Etherlink enhancements offer improvements, optimizing for next-level performance requires rethinking application interaction with the network. The UEC focuses on making RDMA a native Ethernet application, adding new traffic distribution semantics and modern congestion control on top of standard Ethernet and IP layers. This leads to Ultra Ethernet Transport (UET) – RDMA reimagined for modern workloads without proprietary infrastructure.

Native Libraries and libfabric

Maximizing performance requires close interaction between applications and transport. UEC utilizes the mature and ubiquitous libfabric 2.0 API, standardized by the Open Fabrics Alliance. UET's native libfabric transport protocol ensures efficient interaction between application, API, and transport. Libfabric's versatility and wide adoption centralize various RDMA semantics under a single interface, simplifying application porting across different systems and accelerator architectures. For applications using PyTorch and *CCL libraries, network plugins can hide these complexities, making the transition to UET straightforward.

Traffic Forwarding: Packet Spraying

UET offers multiple forwarding paradigms for RDMA workloads. A fundamental concept is the evolution from flow-based traffic distribution to packet spraying from the source NIC. Traditional transport layers require packets to arrive in order, forcing all packets of a flow to follow the same network path. Packet spraying allows packets to take different paths, eliminating the need for flow hashing and reducing imbalanced traffic distribution that leads to congestion. The destination NIC must be able to receive packets out of order and reassemble the conversation for the application.

Existing out-of-order packet arrival solutions are proprietary implementations built on ROCEv2 standards, limiting interoperability and flexibility. UET is designed from the ground up for packet spraying, ensuring optimal efficiency. Each UET packet carries information for direct memory writes. The NIC only needs to maintain packet metadata, enabling high throughput, low latency, and low cost. UET's native implementation supports reordering for all message types without costly payload buffering. Efficiency is further improved by new drop detection mechanisms allowing retransmission of individual lost packets instead of a full round trip, significantly improving latency and good-put while minimizing network load.

Congestion Management

Traditionally, packet sequence numbers detect dropped packets, often triggering inefficient go-back-N retransmissions. This is problematic with packet spraying due to variable arrival times. UET natively solves packet loss detection and recovery by providing selective acknowledgment and retransmission of individual packets. The receiver specifies received, missing, and dropped packets, allowing the sender to resend only the necessary ones, reducing completion time and network load.

To proactively detect dropped packets, UEC uses heuristics based on network reordering. For congestion-related packet loss, UET optionally uses packet trimming, supported by Arista Etherlink platforms. When a packet arrives at a congested switch, it can be truncated to a minimal size (e.g., 64B) instead of being dropped. The trimmed packet is placed in a higher priority queue and forwarded. The destination reflects the trimmed packet back to the sender as a NAK (Negative Acknowledgment). This serves two purposes:

Explicit Notification: The trimmed packet acts as an explicit notification of a dropped packet, enabling efficient detection and retransmission, crucial for recovery from network oversubscription.
Congestion Notification: It provides powerful congestion notification, signaling both sender and receiver to slow down transmission rates. The high-priority forwarding of trimmed packets ensures this notification reaches the receiver quickly.

This trimming mechanism allows the sender to react to network congestion more quickly and intelligently, reducing its sending rate and attempting to route around congested paths.

Advanced Connection Setup and Host-based Flow Control

Avoiding congestion is preferable to managing it. UEC introduces "Ephemeral Connections" and two new congestion control schemes to achieve this without performance overhead.

Ephemeral Connections: Enable fast connection startup by eliminating the delay of a round-trip handshake before data flows. Connections are established on demand by the first data packet and do not require explicit termination. This reduces application latency and the need for costly connection state maintenance on the NIC.

Network Signal Congestion Control (NSCC): A sender-based method that paces transmission rates using metrics like network delay, trimmed packets, and ECN (Explicit Congestion Notification) notifications to detect and react to congestion by throttling transmission rates at the source.

Receiver Credit Congestion Control (RCCC): An optional receiver-based mechanism that efficiently manages in-cast scenarios (multiple packets arriving in parallel on different interfaces but needing serialization). RCCC allows each receiver to generate and allocate credits fairly across senders, preventing queue buildup in the last-hop switch and maximizing receiver throughput. NSCC and RCCC can be used independently or together for performance optimization.

Security

With the increasing value of AI models and sensitive data, securing data in-flight is essential, especially in multi-tenant environments. UET prioritizes security with optional end-to-end encryption and authentication using technologies like AES-GCM, Post-Quantum Cryptography (PQC) Key Derivation Functions (KDF), and replay prevention between UET hosts. A key feature is a novel group keying scheme optimized for AI and HPC computations. A single group key is shared among all members of a job (e.g., all XPUs for a tenant), and each NIC derives a unique key for each connection. Encryption covers the IP header, protecting model data and preventing unauthorized access, data injection, or exfiltration.

Additional Future Capabilities

The UEC has standardized two optional hardware-based features for hop-by-hop performance improvement:

Link Level Retry (LLR): A retransmit mechanism on an individual link basis. Switches implement a small buffer on each port. If packets are dropped due to uncorrectable FEC errors, LLR retransmits them without host involvement, avoiding costly end-to-end retransmissions and improving performance reliability, especially for time-sensitive collectives.
Credit-Based Flow Control (CBFC): A modern alternative to Priority Flow Control (PFC) for avoiding drops. Unlike PFC, which requires per-link tuning and offers coarse granularity, CBFC allows the receiving switch to request exactly the number of packets it has buffer space for. This avoids complex link-specific tuning and allows for more efficient buffer utilization and a larger number of virtual traffic classes compared to PFC's 802.1p header limitations.

These features require new logic design in switching silicon and will be available in future next-generation systems.

Summary

UEC updates the relationship between AI and HPC applications and networks. Tight integration between application semantics and network behaviors creates a native transport mechanism that combines the strengths of RDMA with best-in-class Ethernet solutions, providing a powerful platform for building next-generation applications on Ethernet Transport.

Arista, as a founding member of the UEC, is committed to this vision. They are laying the groundwork for best-in-class, open standards-based infrastructure with diverse platforms that offer freedom of choice and flexibility for re-architecture and redeployment, maximizing long-term investment protection. Arista's Etherlink portfolio is UET-ready, and the company is actively developing future systems and partnering with industry pioneers to build the best Ethernet networks for high-performance computing.

References

The Ultra Ethernet Consortium Launches Specification 1.0
Ultra Ethernet Consortium
Ultra Ethernet Specification
Ultra Ethernet Whitepaper
Arista 800G Portfolio
Arista Blog Microsite

Contact Information

Santa Clara—Corporate Headquarters
5453 Great America Parkway, Santa Clara, CA 95054
Phone: +1-408-547-5500
Fax: +1-408-538-8920
Email: info@arista.com

Ireland—International Headquarters
3130 Atlantic Avenue
Westpark Business Campus
Shannon, Co. Clare
Ireland

Vancouver—R&D Office
9200 Glenlyon Pkwy, Unit 300
Burnaby, British Columbia
Canada V5J 5J8

San Francisco—R&D and Sales Office
1390 Market Street, Suite 800
San Francisco, CA 94102

India—R&D Office
Global Tech Park, Tower A, 11th Floor
Marathahalli Outer Ring Road
Devarabeesanahalli Village, Varthur Hobli
Bangalore, India 560103

Singapore—APAC Administrative Office
9 Temasek Boulevard
#29-01, Suntec Tower Two
Singapore 038989

Nashua—R&D Office
10 Tara Boulevard
Nashua, NH 03062

	High-Performance Ethernet Networking for AI Systems: Configuration & Deployment Guide This guide details the configuration and deployment of high-performance Ethernet networking for Artificial Intelligence (AI) systems, focusing on Arista switches and Broadcom NICs. It covers essential topics such as RDMA over Converged Ethernet (RoCE), Priority Flow Control (PFC), and Explicit Congestion Notification (ECN) to ensure efficient, low-latency data transfer critical for AI/ML workloads.
	Arista 720XP Series Campus PoE Switches: Datasheet and Features Explore the Arista 720XP series of campus PoE switches, featuring high-speed connectivity, advanced Power over Ethernet (PoE), segmentation capabilities, and integrated network management for modern campus environments. This datasheet details product features, specifications, and ordering information.
	Arista 클라우드 네트워킹: 스케일링 아웃 데이터센터 네트워크 Arista Networks의 이 백서는 현대 데이터센터를 위한 확장 가능하고 비용 효율적인 클라우드 네트워킹 아키텍처의 구축 및 구현에 대한 접근 방식을 상세히 설명합니다. Arista의 스파인-리프 및 스플라인 네트워크 설계, 개방형 표준 및 유연성을 강조하는 핵심 설계 원칙, 그리고 Arista EOS 운영 체제의 이점을 통해 데이터센터의 성능, 확장성 및 효율성을 최적화하는 방법을 탐구합니다.
	Arista 720XP Series Campus PoE Switches Datasheet Comprehensive datasheet detailing the Arista 720XP series of campus PoE switches, covering product overview, key features, technical specifications, connectivity options, power delivery, environmental factors, compliance, and ordering information.
	Arista 7010T Gigabit Ethernet Data Center Switches Datasheet Datasheet for Arista 7010T Series Gigabit Ethernet Data Center Switches, detailing product highlights, overview, high availability, out-of-band networks, scaling data center performance, software-defined networking, advanced event management, enhanced features, provisioning, monitoring, unified forwarding table, EOS licensed features, standards compliance, SNMP MIBs, and technical specifications.
	Arista and Broadcom AI Networking Solution Brief A solution brief detailing the Arista and Broadcom partnership for high-performance AI networking, focusing on 400G and 800G solutions with optimized RoCE, power efficiency, and advanced features for AI data centers.
	Arista CCS-720XPM Campus PoE Switches: Features, Specifications, and Data Sheet Comprehensive data sheet for Arista CCS-720XPM Campus PoE Switches, detailing features, specifications, PoE capabilities, security, management with CloudVision, and ordering information.
	Arista 7280R4 Series: High-Performance Data Center Switch Routers Explore the Arista 7280R4 Series, a line of high-performance, deep-buffered fixed switch routers designed for demanding Cloud, AI/ML, Data Center, and Service Provider networks. Discover features like 800GbE connectivity, advanced routing capabilities, wire-speed encryption, and Arista's extensible operating system.