Q: 16
Your AI infrastructure team is observing out-of-memory (OOM) errors during the execution of large
deep learning models on NVIDIA GPUs. To prevent these errors and optimize model performance,
which GPU monitoring metric is most critical?
Options
Discussion
I don’t think it’s B. A is the one you want since OOM errors directly relate to memory, not how busy the cores are. B is a common trap since high utilization might imply heavy load but doesn't cause OOM itself. Pretty sure A is right but open if anyone sees it differently.
A GPU memory usage is what you want to keep an eye on for OOM issues.
B or A here. I went with B because high core utilization usually means the system is working at max capacity, and when you see OOM errors it's easy to blame overall resource exhaustion rather than just memory. Maybe that's off though, since OOM is a memory-specific thing. If anyone's seen a similar question in exam reports, let me know if A is a safer bet.
Maybe B here. If the GPUs are being maxed out on compute cores, that could also impact model execution speed and efficiency, right? I know OOM errors are memory issues, but sometimes high core load hints at overall resource exhaustion. Not 100% sure.
Watching memory usage is the key for OOM errors, so A. None of the others really alert you before a crash happens.
A
Be respectful. No spam.