Adaptive Routing For AI

Overview

Adaptive Routing for AI refers to the traffic engineering techniques that dynamically distribute AI training traffic across available fabric paths to maximize bandwidth utilization and maintain performance under link failures. Traditional ECMP hashing is fundamentally inadequate for AI workloads because training traffic has low entropy (few destination connections) and is bursty, causing hash imbalance. Three approaches are being deployed: NVIDIA's Global Adaptive Routing (fabric-wide failure awareness), Broadcom's ARS (local flowlet and per-packet spray modes), and Alibaba's custom multi-path transport (Solar protocol). The key insight across all approaches is that local-only routing decisions cause disproportionate performance collapse at scale.

These techniques are foundational to making Ethernet For AI Training competitive with InfiniBand, and are closely coupled to AI Network Telemetry and AI Traffic Engineering practices. At 100K+ GPU scale, link failures are statistically inevitable, making adaptive routing a prerequisite rather than an optimization.

Sign in to read the full article.

Sign In