Q: 5
You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require
access to multiple GPUs across different nodes, but inter-node communication seems slow,
impacting performance.
What is a potential networking configuration you would implement to optimize inter-node
communication for distributed training?
Options
Discussion
Anyone used official NVIDIA docs or hands-on labs for cluster networking configs on this? Practice exams seem to push D but I've seen B suggested too.
D. but if the question said you couldn't upgrade hardware, B might've been right. All about whether new gear's allowed or not.
D here. InfiniBand cuts latency and boosts bandwidth, which is key for distributed training jobs across nodes. B is tempting, but jumbo frames just tweak Ethernet, not the same performance jump. Pretty sure D's what they're looking for.
B , saw something like this on a practice set. Jumbo frames can help with large data transfers if you're on Ethernet.
Maybe B. Jumbo frames get mentioned a lot for high data throughput, especially if you're stuck on Ethernet instead of InfiniBand.
Why is nobody talking about C? Does a dedicated storage network really help distributed training comms vs InfiniBand?
D
B or D? I was thinking B since enabling jumbo frames helps with larger AI job traffic, and not every cluster has InfiniBand installed out of the box. Pretty sure D's the high-perf answer if hardware can be upgraded, but for existing Ethernet setups, B is a common tweak. Anyone prefer B in practice?
InfiniBand is the key upgrade here, so D. It directly targets the latency and bandwidth issues common in distributed training jobs, whereas B (jumbo frames) only tweaks Ethernet but can't match InfiniBand performance. Pretty sure D is right unless there's a restriction on hardware changes.
D is right here since InfiniBand is built for this kind of low-latency, high-throughput traffic between nodes, perfect for distributed AI training. B's a common trap if you're thinking Ethernet only, but nothing in the question says you can't use better hardware. Pretty sure about D but open to pushback.
Be respectful. No spam.