Q: 15
In your AI infrastructure, several GPUs have recently failed during intensive training sessions. To
proactively prevent such failures, which GPU metric should you monitor most closely?
Options
Discussion
A tbh
Its B, not A. Some exam reports and the official study guide mention power metrics for proactive monitoring.
Call it A
B or A? I saw something like this show up in a practice test and they hinted at power consumption being an early sign before actual temp spikes. Not sure if that's actually the best indicator but figured I'd mention it.
A imo. Had something like this in a mock, and temp was the main thing to watch. Overheating leads straight to hardware issues, way more direct than power or memory utilization.
Not totally sure but I think it's A. Anyone else seeing temperature issues on their GPUs lately?
Be respectful. No spam.