Question 11 - NCP-AAI Exam Dumps 2026 – NVIDIA Agentic AI Professional Cert

Q: 11

You are deploying a multi-agent customer-support system on Kubernetes using NVIDIA GPU nodes and Triton Inference Server. Traffic spikes during product launches. You need <100ms response times, zero downtime, automatic GPU scaling, and full monitoring. Which deployment setup best achieves cost-effective, reliable, low-latency scaling?

Options

Correct Answer:

Explanation

This option provides a comprehensive, production-grade solution addressing all requirements. Spanning availability zones ensures high availability (zero downtime). A mix of GPU types can be cost-effective. The combination of Cluster Autoscaler (for nodes) and Horizontal Pod Autoscaler (for pods) provides multi-level, automatic scaling. Critically, using Prometheus with custom GPU and latency metrics for HPA is the correct way to scale inference workloads, ensuring the <100ms target is met. NVIDIA DCGM provides the necessary deep GPU monitoring for performance analysis and alerting, which is superior to default metrics.

Why Incorrect

A. Skipping readiness probes compromises reliability and zero-downtime goals. Scaling on network throughput is an inaccurate proxy for GPU workload demand.

B. Disabling autoscaling and using a fixed pod count is not cost-effective and cannot handle traffic spikes efficiently, violating the core scaling requirement.

D. Spot instances risk preemption, which conflicts with the "zero downtime" requirement. Scaling on memory usage is inappropriate for GPU-bound inference tasks.

References

1. NVIDIA. (2024). NVIDIA GPU Operator Documentation. Section on "Scaling and Resource Management." The documentation details using the NVIDIA GPU Operator with Kubernetes Cluster Autoscaler and HPA based on GPU metrics for efficient scaling.

2. NVIDIA. (2024). NVIDIA DCGM (Data Center GPU Manager) Documentation. The documentation outlines how DCGM exposes detailed GPU metrics to monitoring systems like Prometheus, which is essential for observability and intelligent autoscaling as described in the correct answer.

3. Kubernetes Documentation. (2024). Horizontal Pod Autoscaler. The official documentation describes configuring HPA with custom metrics (e.g., from Prometheus), which is the standard mechanism for scaling workloads based on application-specific signals like GPU utilization or request latency.

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE