1. NVIDIA Technical Blog, "A Deep Dive into Congestion Control for Large-Scale AI and HPC" (May 17, 2023): This article explains the nature of AI workloads: "AI and HPC workloads are characterized by bursty and synchronized communication patterns... This leads to incast congestion, where multiple servers communicate with a single server simultaneously." It further details the inadequacy of static mechanisms like ECMP, stating, "Traditional lossless Ethernet relies on a priority-based flow control (PFC) mechanism... However, PFC is a hop-by-hop mechanism that is reactive and has no end-to-end visibility of the network congestion." This supports the idea that traditional mechanisms increase congestion risk.
2. NVIDIA Whitepaper, "NVIDIA Spectrum-X Ethernet Platform for AI Clouds" (WP-10663-001v1.0, May 2023): Page 4, Section "AI Cloud Networking Challenges," states: "Traditional networks use static routing protocols and load-balancing mechanisms such as Equal Cost Multipath (ECMP). These mechanisms are oblivious to the network state and can lead to network congestion, high latencies, and low network utilization." This directly supports option C and refutes option D.
3. NVIDIA Whitepaper, "A Deep Dive into the NVIDIA Quantum-2 InfiniBand Architecture" (WP-09993-001v1.1, November 2021): Page 10, Section "Adaptive Routing," describes the problem solved by adaptive routing: "AI and HPC application communication patterns can create highly imbalanced traffic loads... Static routing may lead to network congestion on some paths, while other paths are underutilized." This highlights the issue with static, flow-based routing (like ECMP) and the nature of AI traffic (imbalanced loads, i.e., elephant flows), supporting options A and C.