Q: 5
You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require
access to multiple GPUs across different nodes, but inter-node communication seems slow,
impacting performance.
What is a potential networking configuration you would implement to optimize inter-node
communication for distributed training?
Options
Discussion
For GPU-heavy distributed training, InfiniBand (D) seems like the best shot. It gives you way lower latency and much better bandwidth than Ethernet, which matters when syncing model weights all the time. Pretty sure that's what you'd see in most real-world AI clusters, but someone chime in if they've made B work at scale.
Definitely D here. InfiniBand is what you'd typically see in HPC clusters for distributed AI training because of its low latency and high bandwidth, which really matters more than jumbo frames on Ethernet. Pretty sure I saw a similar question in practice exams too. Agree?
Its D. InfiniBand is built for this sort of low-latency, high-throughput workload so it beats standard Ethernet networking here.
B
Be respectful. No spam.