1. NVIDIA AI Enterprise Deployment Guide (v5.0): In the section "Monitoring
" it states
"NVIDIA AI Enterprise includes features for monitoring the health and performance of GPUs... These metrics can be integrated with popular monitoring and alerting solutions like Prometheus and Grafana." This directly confirms the use of Grafana for observability. (Source: NVIDIA AI Enterprise Documentation).
2. NVIDIA DCGM Documentation - Integration with Prometheus and Grafana: This official documentation provides a dedicated section on how to use the dcgm-exporter to feed GPU metrics into Prometheus and visualize them with pre-built or custom Grafana dashboards. This is the standard
documented method for achieving observability in this environment. (Source: NVIDIA DCGM Documentation).
3. HPE Reference Configuration for NVIDIA AI Enterprise on HPE ProLiant Servers: These documents detail the validated software stack. In the "Software Overview" or "Management and Operations" sections
they describe the inclusion of the NVIDIA AI Enterprise suite
which contains the necessary monitoring tools (like DCGM) that integrate with platforms like Grafana for a complete observability solution. (e.g.
Document ID: a50002213enw).