Q: 3
You are tasked with contributing to the operations of an AI data center that requires high availability
and minimal downtime. Which strategy would most effectively help maintain continuous AI
operations in collaboration with the data center administrator?
Options
Discussion
Makes sense to me, option C. Active-passive GPU with DPU-managed failover is exactly how you'd architect HA for AI workloads, at least from everything I've seen.
I remember a similar scenario from labs, in some exam reports, and it's C. This matches what NVIDIA recommends for high availability AI ops.
A is wrong, C. DPU handles network/security, not inference jobs, and CPUs can't really match GPU workloads for HA AI ops. Active-passive GPU clusters are pretty much how NVIDIA does high availability now.
C . GPUs in active-passive clusters plus DPU network failover is standard for minimum downtime.
Option C, Had something like this in a mock, GPU active-passive with DPU handling network failover is the standard HA setup for AI these days. Pretty sure that's what they want.
C vs A. C lines up with how NVIDIA builds high availability-active-passive GPU clusters, and DPUs for network failover/security. A looks tempting but DPUs don't run inference, so it's misleading. Pretty sure C is the best fit, open to counterpoints if I'm missing something.
B
Interesting wording-aren't options A and B a bit of a trap here? DPUs don't actually do the AI inference, and CPUs can't handle GPU workloads at scale for real-time AI ops. Does anyone see a scenario where failover to CPUs would genuinely deliver "minimal downtime" for the kind of workloads NVIDIA's targeting?
C not B. CPUs just aren't a real substitute for GPUs on AI workloads, they're way slower. Active-passive GPU setup plus DPU-powered network failover is designed for this exact high availability scenario. Pretty sure that's what NVIDIA recommends but let me know if there's a counterexample.
A is wrong, B. Redundant CPUs could step in if GPUs fail, so at least something keeps running. I know it's not ideal for full AI workloads but it should help with uptime a bit. Not totally confident though, thoughts?
Be respectful. No spam.