UsingGPUs in active-passive clusters, with DPUs handling real-time network failover and security(C)
is the most effective strategy for maintaining continuous AI operations with high availability and
minimal downtime. Let’s explore this in depth:
Active-Passive GPU Clusters: In this setup, active GPUs handle the primary workload (e.g., training or
inference), while passive GPUs remain on standby, ready to take over if an active node fails. This
redundancy ensures that AI operations continue seamlessly during hardware failures, a common
high-availability design in data centers. NVIDIA’s GPU clusters (e.g., DGX systems) support such
configurations, often managed via orchestration tools like Kubernetes with the NVIDIA GPU
Operator.
Role of DPUs: NVIDIA’s Data Processing Units (e.g., BlueField DPUs) offload network, storage, and
security tasks from CPUs and GPUs, enhancing system resilience. In this strategy, DPUs manage real-
time network failover (e.g., rerouting traffic to passive GPUs) and security (e.g., encryption,
isolation), ensuring uninterrupted data flow and protection during failover events. This reduces
latency and downtime compared to CPU-managed failover.
Why it works: The combination leverages GPU redundancy for compute continuity and DPU
intelligence for network reliability, aligning with NVIDIA’s vision of integrated AI infrastructure.
Monitoring tools (e.g., nvidia-smi, DPU metrics) enable proactive failover triggers, minimizing
disruption.
Why not the other options?
A (DPU-managed inference during GPU downtime): DPUs accelerate networking/storage, not
inference, which requires GPU compute power—making this impractical.
B (CPU redundancy): CPUs can’t match GPU performance for AI workloads, leading to degraded
operation, not continuity.
D (Peak-hour maintenance): Scheduling maintenance during peak hours increases downtime,
contradicting the goal.
NVIDIA’s DPU and GPU cluster documentation supports this high-availability approach (C).
Reference:NVIDIA BlueField DPU documentation; DGX High-Availability Guide on nvidia.com.