Q: 11
You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command
Manager (BCM). The system is experiencing slow performance, and you need to identify the cause.
What is the most effective way to monitor GPU usage across nodes?
Options
Discussion
B makes sense, the dashboard gives you real-time cluster stats so you’re not checking every node one by one.
Don't think D is right here, since nvidia-smi only gives info per node and isn't practical for tracking many nodes in a cluster. B's dashboard shows everything together, which is what the question asks for.
Had something like this in a mock, pretty sure the Base View dashboard (B) is what you want for cluster-wide GPU stats.
Be respectful. No spam.
Q: 12
You need to do maintenance on a node. What should you do first?
Options
Discussion
Why are B and C both listed if they say the same thing? Looks like a distractor.
Need to prevent new jobs hitting the node first, so A. That way existing jobs finish and Slurm handles it smoothly.
Yep, it's A. Drain lets running jobs end gracefully so nothing gets killed off early.
It’s A. Always best to drain the node with
scontrol update so current jobs finish before maintenance. Setting state to down would just kill running jobs, so draining is less disruptive. Pretty sure that’s standard Slurm procedure. Agree?Is there a real difference between B and C here or is that a trick?
Be respectful. No spam.
Q: 13
A DGX H100 system in a cluster is showing performance issues when running jobs.
Which command should be run to generate system logs related to the health report?
Options
Discussion
Option C but honestly not super sure on this one. Anyone else seen this before?
Be respectful. No spam.
Q: 14
A Slurm user needs to submit a batch job script for execution tomorrow.
Which command should be used to complete this task?
Options
Discussion
Probably A. Had something like this in a mock, and sbatch with -begin lets you delay batch jobs. The other options aren't for batch script submission. Let me know if you think otherwise.
Its A
A tbh
Guessing C
I don’t think it’s B. A is the right one for Slurm batch jobs, since submit isn’t even a Slurm command. Easy to mix that up if you’re new to HPC tools.
Does the question mean the user wants the job to start automatically at midnight, or just anytime tomorrow? If it's about a specific scheduled time (like 8am), that could impact if A is still correct.
Be respectful. No spam.
Q: 15
You have noticed that users can access all GPUs on a node even when they request only one GPU in
their job script using --gres=gpu:1. This is causing resource contention and inefficient GPU usage.
What configuration change would you make to restrict users’ access to only their allocated GPUs?
Options
Discussion
Option B Seen this on other clusters, you have to set ConstrainDevices=yes in cgroup.conf or SLURM won't restrict GPU access. The other options don't deal with device isolation directly.
Totally agree, B. Setting ConstrainDevices=yes is how you stop jobs from hogging all GPUs on the node.
Pretty sure it's B here. Only ConstrainDevices=yes in cgroup.conf will actually limit the GPUs jobs see, D is a common trap since adding CPUs doesn't isolate devices. If someone saw it work differently, let me know.
B , enabling ConstrainDevices in cgroup.conf is the fix for GPU isolation. The other options won’t stop users from grabbing more GPUs than assigned. Pretty sure that’s the config needed if using Slurm.
Be respectful. No spam.
Question 11 of 20 · Page 2 / 2