Q: 13
In managing an AI data center, you need to ensure continuous optimal performance and quickly
respond to any potential issues. Which monitoring tool or approach would best suit the need to
monitor GPU health, usage, and performance metrics across all deployed AI workloads?
Options
Discussion
Had something like this in a mock, D is correct for sure. DCGM gives deep GPU insights out of the box which is what most exam scenarios are after. Anyone disagree?
I don't think B is right here, even though Node Exporter with Prometheus can be extended for GPU stats. D (NVIDIA DCGM) is purpose-built for GPU health, so it fits the question much better imo. Anyone else see similar on practice exams?
Option B. Prometheus with Node Exporter. Some setups use Node Exporter for GPU stats so it's a common trap here.
Its B, since Prometheus with Node Exporter can collect system metrics and you can add exporters for GPU. Not 100% sure but seen setups use it for monitoring a range of hardware.
Be respectful. No spam.