Q: 4
You are deploying a large-scale AI model training pipeline on a cloud-based infrastructure that uses
NVIDIA GPUs. During the training, you observe that the system occasionally crashes due to memory
overflows on the GPUs, even though the overall GPU memory usage is below the maximum capacity.
What is the most likely cause of the memory overflows, and what should youdo to mitigate this
issue?
Options
Discussion
D, not A. Batch size matters for total capacity but if overall usage is below max, it's probably a memory fragmentation issue like D describes. Unified memory helps with that sort of problem. I think D is right but open to other takes.
A Why wouldn't batch size be the main problem here?
D Unified memory management helps when fragmentation causes overflows even if usage looks fine. Seen this in practice, pretty sure that's it.
Be respectful. No spam.