Ethernet For AI Training

Overview

Ethernet for AI Training refers to the emerging approach of using standard Ethernet with merchant silicon and open networking stacks (SONiC, SAI) as a viable alternative to NVIDIA InfiniBand for large-scale AI training cluster interconnects. While InfiniBand has dominated AI training networking due to its native RDMA support and predictable performance, Ethernet-based solutions using RoCE (RDMA over Converged Ethernet) are gaining traction at hyperscale operators who value multi-vendor silicon choice, open-source software, and the operational familiarity of Ethernet.

The 2023 OCP Global Summit Networking track provided compelling evidence from multiple hyperscalers that Ethernet with purpose-built enhancements -- adaptive routing, telemetry-based load balancing, programmable congestion control, and advanced traffic engineering -- can approach InfiniBand-class performance for AI training workloads at scales exceeding 10,000 nodes. Alibaba operates production 10,000+ node AI training clusters on Broadcom Tomahawk 5 merchant silicon with SONiC. Tencent achieves 40% bandwidth improvement over baseline DCQCN with IFA-based telemetry load balancing at 64K-256K GPU scale. The Ultra Ethernet Consortium (UEC) is formalizing protocol extensions (CSIG, LLR, CBFC) to harden Ethernet for AI backend fabrics, providing a standards path that complements proprietary enhancements from NVIDIA Spectrum-X and others.