In-Network Collective Acceleration

Overview

In-Network Collective Acceleration (INC) offloads collective communication operations (AllReduce, AllToAll, AllGather, ReduceScatter) from GPU endpoints to network switch silicon, reducing exposed GPU idle time during AI training. Collectives account for 90%+ of AI fabric bandwidth, and GPU exposed idle time from waiting on collectives can reach 20-30% of inference step time. A 51 Tbps switch requires only 3 TFlops of BF16 adder compute to participate in reductions, making INC feasible without significant silicon area cost.

INC is an Ethernet-for-AI-training native acceleration technique — unlike proprietary collective offload schemes that require custom NICs or headers, INC as implemented in Broadcom's Tomahawk Ultra operates over standard RoCE with no proprietary protocol modifications, enabling any RDMA-capable NIC to participate. The standardization push, co-driven by Oracle Cloud Infrastructure (operating 800k+ GPU clusters), targets the SAI Switch Abstraction Interface as the configuration plane, ensuring multi-vendor interoperability rather than lock-in to a single switch vendor's INC implementation.