Has anyone here actually used the Base View dashboard in BCM for a live cluster? Wondering if it shows real-time GPU stats for all nodes at once or if you still need to check per node sometimes. Practice exams usually point toward dashboards but curious about your real-world experience with this tool.
Option A draining with scontrol update, is the safe call in normal maintenance. That lets current jobs finish before you take the node down, so no running stuff gets lost. B and C jump straight to DOWN and might kill jobs, which is a common mistake folks make here. Pretty sure it's A unless they want urgent shutdown?
nvsm dump health collects a full health report with detailed diagnostics, not just regular logs. I think this is what support usually asks for when there are DGX performance issues. Let me know if you’ve seen it used differently.C . Saw a similar question in other practice sets and nvsm dump health is the one that bundles health diagnostics plus logs in a package for support, not just raw logs like B. Easy to miss since B looks tempting but it's not as comprehensive. Open to corrections if anyone's got docs showing otherwise.
Anybody tried using D here? I swear the official guide listed something like nvsm health -dump-log for health reports. Practice tests might help confirm this.
D for me, srun allows the -begin flag too so it should work as well. Might be mixing up interactive and batch job details though. Correct me if sbatch is the only valid pick here.
sbatch is specifically for batch scripts so A fits best here. srun is more for interactive runs. Pretty sure that's what the question's after, but open to correction if anyone's seen a slurm update change this!
Why does NVIDIA keep tossing in fake commands like "submit" on these? Anyway, for scheduling a batch script to run tomorrow with Slurm, it's the sbatch -begin=tomorrow syntax. Only A matches the actual workflow you'd use. Correct me if I'm missing something but that's what I've always seen in exam reports.
Not D, pretty sure it has to be B. Only B (ConstrainDevices in cgroup.conf) actually restricts GPU device access at the OS level-changing the job script in D doesn't stop jobs from seeing all GPUs. D's a common distractor here.