AI Traffic Engineering

Overview

AI Traffic Engineering encompasses the techniques for optimizing network traffic distribution and congestion management specifically for AI/ML training workloads. Unlike traditional data center traffic (many small TCP flows with natural entropy), AI training generates synchronous, low-entropy elephant flows using RoCE/RDMA that are extremely sensitive to packet loss, latency, and load imbalance. Standard ECMP hashing fails because training traffic has few destination connections, and bursty collective communication patterns (AllToAll, AllReduce) create microsecond-scale incast conditions that overwhelm switch buffers.

Multiple approaches emerged at the 2023 OCP Global Summit: centralized controller-based traffic engineering using exact-match tables (Meta/Broadcom), telemetry-based adaptive load balancing (Broadcom IFA at Tencent), dynamic and global load balancing (Broadcom DLB/GLB at Alibaba), source routing with pre-programmed paths (Marvell FPR), and packet-level adaptive routing (NVIDIA Spectrum-X). The key insight across all approaches is that AI traffic requires fundamentally different treatment from traditional Ethernet traffic. These techniques are closely related to Adaptive Routing For AI, AI Network Telemetry, and are a core enabler for Ethernet For AI Training.

Sign in to read the full article.

Sign In