Demystifying Ultra Ethernet
By Tom Emmons and John Peach
Introduction
In the early era of AI/ML, clusters were treated as specialist technology islands, developed independently from traditional services and networks. Today, AI/ML is becoming business-critical, requiring a common technology paradigm that fits within established enterprise parameters. This necessitates a model that supports accelerated compute for multiple workload types, built and operated using tried-and-trusted technologies, benefiting from the same skill sets, economics, and openness that underpin global networks.
Ethernet and IP have proven their versatility over the last 50 years, adapting to replace and improve upon legacy technologies. Ethernet is now the interconnect technology of choice for the majority of AI accelerators (XPUs). Advanced networking solutions, such as Arista's Etherlink™ portfolio, already outperform legacy proprietary interconnect technologies. Ultra Ethernet represents the next natural evolution in ubiquitous networking.
The Ultra Ethernet Consortium (UEC), of which Arista is a founding member, aims to enhance Ethernet for AI and HPC needs. Over 100 member companies and 1000 participants collaborated on the 1.0 specification, which is now driving hardware implementations to elevate cluster performance using advanced Ethernet switching platforms.
Core UEC Vision: Reimagining RDMA
Remote Direct Memory Access (RDMA) is critical for AI and HPC applications, enabling systems and processors to exchange data at high speeds (400 Gbps today, 800 Gbps in the near future). This efficient communication facilitates workload distribution across multiple servers and processors with minimal performance cost, enabling parallel computation for models spanning thousands of accelerators.
Historically, high flow rates and synchronized large-volume RDMA traffic created challenges for Ethernet networks, leading to hashing issues and congestion. While Arista's Etherlink enhancements offer improvements, optimizing for next-level performance requires rethinking application interaction with the network. The UEC focuses on making RDMA a native Ethernet application, adding new traffic distribution semantics and modern congestion control on top of standard Ethernet and IP layers. This leads to Ultra Ethernet Transport (UET) – RDMA reimagined for modern workloads without proprietary infrastructure.
Native Libraries and libfabric
Maximizing performance requires close interaction between applications and transport. UEC utilizes the mature and ubiquitous libfabric 2.0 API, standardized by the Open Fabrics Alliance. UET's native libfabric transport protocol ensures efficient interaction between application, API, and transport. Libfabric's versatility and wide adoption centralize various RDMA semantics under a single interface, simplifying application porting across different systems and accelerator architectures. For applications using PyTorch and *CCL libraries, network plugins can hide these complexities, making the transition to UET straightforward.
Traffic Forwarding: Packet Spraying
UET offers multiple forwarding paradigms for RDMA workloads. A fundamental concept is the evolution from flow-based traffic distribution to packet spraying from the source NIC. Traditional transport layers require packets to arrive in order, forcing all packets of a flow to follow the same network path. Packet spraying allows packets to take different paths, eliminating the need for flow hashing and reducing imbalanced traffic distribution that leads to congestion. The destination NIC must be able to receive packets out of order and reassemble the conversation for the application.
Existing out-of-order packet arrival solutions are proprietary implementations built on ROCEv2 standards, limiting interoperability and flexibility. UET is designed from the ground up for packet spraying, ensuring optimal efficiency. Each UET packet carries information for direct memory writes. The NIC only needs to maintain packet metadata, enabling high throughput, low latency, and low cost. UET's native implementation supports reordering for all message types without costly payload buffering. Efficiency is further improved by new drop detection mechanisms allowing retransmission of individual lost packets instead of a full round trip, significantly improving latency and good-put while minimizing network load.
Congestion Management
Traditionally, packet sequence numbers detect dropped packets, often triggering inefficient go-back-N retransmissions. This is problematic with packet spraying due to variable arrival times. UET natively solves packet loss detection and recovery by providing selective acknowledgment and retransmission of individual packets. The receiver specifies received, missing, and dropped packets, allowing the sender to resend only the necessary ones, reducing completion time and network load.
To proactively detect dropped packets, UEC uses heuristics based on network reordering. For congestion-related packet loss, UET optionally uses packet trimming, supported by Arista Etherlink platforms. When a packet arrives at a congested switch, it can be truncated to a minimal size (e.g., 64B) instead of being dropped. The trimmed packet is placed in a higher priority queue and forwarded. The destination reflects the trimmed packet back to the sender as a NAK (Negative Acknowledgment). This serves two purposes:
- Explicit Notification: The trimmed packet acts as an explicit notification of a dropped packet, enabling efficient detection and retransmission, crucial for recovery from network oversubscription.
- Congestion Notification: It provides powerful congestion notification, signaling both sender and receiver to slow down transmission rates. The high-priority forwarding of trimmed packets ensures this notification reaches the receiver quickly.
This trimming mechanism allows the sender to react to network congestion more quickly and intelligently, reducing its sending rate and attempting to route around congested paths.
Advanced Connection Setup and Host-based Flow Control
Avoiding congestion is preferable to managing it. UEC introduces "Ephemeral Connections" and two new congestion control schemes to achieve this without performance overhead.
Ephemeral Connections: Enable fast connection startup by eliminating the delay of a round-trip handshake before data flows. Connections are established on demand by the first data packet and do not require explicit termination. This reduces application latency and the need for costly connection state maintenance on the NIC.
Network Signal Congestion Control (NSCC): A sender-based method that paces transmission rates using metrics like network delay, trimmed packets, and ECN (Explicit Congestion Notification) notifications to detect and react to congestion by throttling transmission rates at the source.
Receiver Credit Congestion Control (RCCC): An optional receiver-based mechanism that efficiently manages in-cast scenarios (multiple packets arriving in parallel on different interfaces but needing serialization). RCCC allows each receiver to generate and allocate credits fairly across senders, preventing queue buildup in the last-hop switch and maximizing receiver throughput. NSCC and RCCC can be used independently or together for performance optimization.
Security
With the increasing value of AI models and sensitive data, securing data in-flight is essential, especially in multi-tenant environments. UET prioritizes security with optional end-to-end encryption and authentication using technologies like AES-GCM, Post-Quantum Cryptography (PQC) Key Derivation Functions (KDF), and replay prevention between UET hosts. A key feature is a novel group keying scheme optimized for AI and HPC computations. A single group key is shared among all members of a job (e.g., all XPUs for a tenant), and each NIC derives a unique key for each connection. Encryption covers the IP header, protecting model data and preventing unauthorized access, data injection, or exfiltration.
Additional Future Capabilities
The UEC has standardized two optional hardware-based features for hop-by-hop performance improvement:
- Link Level Retry (LLR): A retransmit mechanism on an individual link basis. Switches implement a small buffer on each port. If packets are dropped due to uncorrectable FEC errors, LLR retransmits them without host involvement, avoiding costly end-to-end retransmissions and improving performance reliability, especially for time-sensitive collectives.
- Credit-Based Flow Control (CBFC): A modern alternative to Priority Flow Control (PFC) for avoiding drops. Unlike PFC, which requires per-link tuning and offers coarse granularity, CBFC allows the receiving switch to request exactly the number of packets it has buffer space for. This avoids complex link-specific tuning and allows for more efficient buffer utilization and a larger number of virtual traffic classes compared to PFC's 802.1p header limitations.
These features require new logic design in switching silicon and will be available in future next-generation systems.
Summary
UEC updates the relationship between AI and HPC applications and networks. Tight integration between application semantics and network behaviors creates a native transport mechanism that combines the strengths of RDMA with best-in-class Ethernet solutions, providing a powerful platform for building next-generation applications on Ethernet Transport.
Arista, as a founding member of the UEC, is committed to this vision. They are laying the groundwork for best-in-class, open standards-based infrastructure with diverse platforms that offer freedom of choice and flexibility for re-architecture and redeployment, maximizing long-term investment protection. Arista's Etherlink portfolio is UET-ready, and the company is actively developing future systems and partnering with industry pioneers to build the best Ethernet networks for high-performance computing.
References
- The Ultra Ethernet Consortium Launches Specification 1.0
- Ultra Ethernet Consortium
- Ultra Ethernet Specification
- Ultra Ethernet Whitepaper
- Arista 800G Portfolio
- Arista Blog Microsite
Contact Information
Santa Clara—Corporate Headquarters
5453 Great America Parkway, Santa Clara, CA 95054
Phone: +1-408-547-5500
Fax: +1-408-538-8920
Email: info@arista.com
Ireland—International Headquarters
3130 Atlantic Avenue
Westpark Business Campus
Shannon, Co. Clare
Ireland
Vancouver—R&D Office
9200 Glenlyon Pkwy, Unit 300
Burnaby, British Columbia
Canada V5J 5J8
San Francisco—R&D and Sales Office
1390 Market Street, Suite 800
San Francisco, CA 94102
India—R&D Office
Global Tech Park, Tower A, 11th Floor
Marathahalli Outer Ring Road
Devarabeesanahalli Village, Varthur Hobli
Bangalore, India 560103
Singapore—APAC Administrative Office
9 Temasek Boulevard
#29-01, Suntec Tower Two
Singapore 038989
Nashua—R&D Office
10 Tara Boulevard
Nashua, NH 03062