View NCP-AIO Exam Questions

Q: 11

You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command Manager (BCM). The system is experiencing slow performance, and you need to identify the cause. What is the most effective way to monitor GPU usage across nodes?

Options

Discussion

Sara Feb 15, 2026 4:17 am

Had a similar scenario in practice, isn't it B since Base View gives you the whole cluster view?

Luke I. Feb 26, 2026 11:25 am

Probably B since the dashboard lets you see GPU stats for the whole cluster in real-time. D is a classic approach but way too manual for monitoring a SuperPOD. Base View just gives better visibility here, correct me if I'm missing something.

Owen Y. Feb 22, 2026 5:50 pm

D imo. nvidia-smi is pretty much the go-to for GPU stats, and if you want actual numbers per node, it's reliable. The dashboard might be nice for a big picture but honestly, running nvidia-smi gives you the live GPU utilization directly where the jobs run. Maybe not as centralized, but more granular I think. Let me know if that's off.

Grace P. Feb 16, 2026 10:14 am

B makes sense, the dashboard gives you real-time cluster stats so you’re not checking every node one by one.

Skyler Y. Feb 20, 2026 4:11 pm

D , nvidia-smi on each node has always shown me exact GPU usage before.

Zoe R. Feb 15, 2026 11:52 pm

Has anyone here actually used the Base View dashboard in BCM for a live cluster? Wondering if it shows real-time GPU stats for all nodes at once or if you still need to check per node sometimes. Practice exams usually point toward dashboards but curious about your real-world experience with this tool.

Riley X. Feb 27, 2026 3:43 pm

B for sure, since Base View dashboard gives that real-time cluster-wide GPU usage. D is too hands-on for a big setup.

Vikram U. Feb 13, 2026 4:18 am

Honestly I would've picked D at first since nvidia-smi is the classic GPU check for each node. But for cluster-wide real-time visibility, that's too manual. So tempting trap here with D.

Quinn V. Feb 16, 2026 3:49 pm

Grace M. Feb 17, 2026 9:58 pm

Nah, it's B here-Base View dashboard does all nodes real-time. D is just per-node, easy to miss that.

Be respectful. No spam.

Correct Answer:

Explanation

NVIDIA Base Command Manager (BCM) is the designated software for managing and monitoring DGX SuperPOD clusters. Its Base View dashboard is specifically designed to provide a centralized, real-time, graphical overview of key performance metrics across all nodes. This includes GPU utilization, GPU memory, CPU usage, and system memory. Using this dashboard is the most effective and efficient method for identifying system-wide performance bottlenecks, such as those caused by GPU resource contention, without needing to manually access individual nodes.

Why Incorrect

A. Checking Slurm logs is useful for diagnosing job scheduling or resource allocation errors, but it does not provide real-time utilization metrics for active hardware.

C. The top command monitors CPU and memory usage but provides no information about GPU utilization, which is a critical component in a DGX cluster.

D. Using nvidia-smi on each node is a valid way to check GPU status, but it is highly inefficient and not scalable for a multi-node SuperPOD cluster.

References

1. NVIDIA Base Command Manager User Guide (v3.2.0): In Chapter 3, "Dashboards," the "Base View" section describes this dashboard as the primary tool for a "high-level overview of the cluster, including GPU, CPU, and memory utilization... essential for quickly identifying system-wide performance bottlenecks."

2. NVIDIA DGX SuperPOD Reference Architecture: The "Management and Monitoring" section details the software stack, emphasizing that NVIDIA Base Command Manager provides the unified, cluster-wide monitoring capabilities necessary for operating the system. It highlights the dashboards for visualizing aggregate system health and performance.

Q: 12

You need to do maintenance on a node. What should you do first?

Options

Discussion

PracticalSec7738 Feb 25, 2026 7:18 am

A . Draining with scontrol update is the safe move so jobs aren't cut off. B and C look like trick options since taking a node down too soon can kill running tasks.

Taylor Feb 23, 2026 9:46 am

A . The official guide and hands-on labs both emphasize draining with scontrol first so you don’t disrupt running jobs. B and C get chosen a lot, but only right for immediate shutdown cases I think. Anyone disagree?

Sean P. Feb 23, 2026 9:45 am

Feels like A, draining with scontrol is the correct first step for standard maintenance. B and C are tempting but setting to down too soon would just kill any running processes, which usually isn't what you want unless it's an emergency. Pretty sure that's what Slurm docs recommend too, but correct me if I'm missing a recent change.

Nathan Feb 26, 2026 1:57 pm

I'd pick B here.

OwenB Feb 24, 2026 1:31 pm

Option A draining with scontrol update, is the safe call in normal maintenance. That lets current jobs finish before you take the node down, so no running stuff gets lost. B and C jump straight to DOWN and might kill jobs, which is a common mistake folks make here. Pretty sure it's A unless they want urgent shutdown?

SteadyLearner6087 Feb 16, 2026 6:32 pm

I keep seeing B in some practice sets and the official guide, so I went with B.

DirectArchitect1873 Feb 19, 2026 9:44 pm

A for sure here. Draining the node lets jobs finish before you do anything else, so nothing important gets interrupted.

Quinn Y. Mar 3, 2026 11:13 pm

C tbh, since both B and C say set to down before maintenance. If you need jobs to stop immediately, setting to down ends them right away, which can be necessary for urgent patching. But I might be missing a nuance with scontrol drain vs down.

Emma C. Mar 2, 2026 9:42 pm

Why are B and C both listed if they say the same thing? Looks like a distractor.

Aaron Mar 2, 2026 12:54 pm

Need to prevent new jobs hitting the node first, so A. That way existing jobs finish and Slurm handles it smoothly.

Be respectful. No spam.

Correct Answer:

Explanation

The correct first step for planned maintenance on a compute node is to place it in the DRAIN state using the scontrol update command. This is the standard, non-disruptive procedure in Slurm. The DRAIN state prevents new jobs from being allocated to the node while allowing any currently running jobs to complete. Once all jobs have finished, the node's state transitions to IDLE+DRAIN, indicating it is safe to take offline for maintenance without interrupting user workloads.

Why Incorrect

B/C. Setting the node state to down is an abrupt action typically reserved for unexpected node failures. It can cause running jobs to be terminated and requeued, which is disruptive and not ideal for planned maintenance.

D. Disabling job scheduling on all compute nodes is excessive and unnecessary for maintenance on a single node. This action would halt the productivity of the entire cluster.

References

1. NVIDIA DGX SuperPOD with NVIDIA DGX A100 Deployment Guide (v1.1)

Section 5.1.1. Taking Nodes Offline for Maintenance: "To take a node offline for maintenance, the best practice is to drain the node. Draining a node prevents new jobs from being scheduled on the node but allows currently running jobs to complete. To drain a node, run the following command: scontrol update nodename= state=drain reason="maintenance""

2. SchedMD (Official Slurm Documentation), scontrol Man Page

Section: update command, NodeName parameter: "State=DRAIN - Indicate that a node is unavailable for use. No new jobs will be allocated to the node. Jobs already running on the node will be allowed to complete. The node state will be changed to idle when all of the jobs on it have completed." This contrasts with State=DOWN, which is described as for a node that is "unavailable for use" and often set automatically when a node fails.

Q: 13

A DGX H100 system in a cluster is showing performance issues when running jobs. Which command should be run to generate system logs related to the health report?

Options

Discussion

Robin W. Feb 14, 2026 10:46 pm

C . nvsm dump health collects a full health report with detailed diagnostics, not just regular logs. I think this is what support usually asks for when there are DGX performance issues. Let me know if you’ve seen it used differently.

Taylor C. Feb 15, 2026 1:13 am

Option C but honestly not super sure on this one. Anyone else seen this before?

MethodicalReviewer8900 Feb 20, 2026 5:41 am

C , nvsm dump health grabs the full diagnostics with health-related logs. B only gets basic logs, but for actual health report troubleshooting C is more complete. I think that's what the question's after.

Morgan Feb 22, 2026 5:22 pm

D imo, nvsm health -dump-log looks like it would be the one to create logs tied directly to system health. Seems logical for troubleshooting health-related issues. Not fully sure since the command options can get confusing, but that's what I'd use.

Jamie Feb 19, 2026 11:10 pm

Its C, though B tricks you since it looks like a straight log pull but won't grab the diagnostic package.

Kevin M. Mar 3, 2026 8:53 pm

Feels like B. That command grabs system logs so seems like a fit if you just want to check logs for health-related issues.

Nina F. Feb 13, 2026 11:13 am

C tbh

Meera R. Feb 12, 2026 11:41 pm

C . Saw a similar question in other practice sets and nvsm dump health is the one that bundles health diagnostics plus logs in a package for support, not just raw logs like B. Easy to miss since B looks tempting but it's not as comprehensive. Open to corrections if anyone's got docs showing otherwise.

Casey N. Feb 13, 2026 2:18 am

Anybody tried using D here? I swear the official guide listed something like nvsm health -dump-log for health reports. Practice tests might help confirm this.

Amelia H. Feb 22, 2026 12:33 pm

Its B. I remember seeing nvsm get logs for pulling system logs during troubleshooting, not option C.

Be respectful. No spam.

Correct Answer:

Explanation

The nvsm dump health command is the correct utility for collecting a comprehensive set of system health and diagnostic information on an NVIDIA DGX system. When executed, it gathers logs, hardware status, software configuration, and performance data, packaging it into a single compressed archive file (.tar.gz). This file is essential for support teams and system administrators to perform an in-depth, offline analysis of system issues, such as performance degradation, without needing direct access to the live system.

Why Incorrect

A. nvsm show logs --save: This command is used to display specific system logs (e.g., event logs) to the console. While the output can be redirected, it does not generate the comprehensive diagnostic bundle that nvsm dump health creates.

B. nvsm get logs: This is not a valid command or syntax within the NVIDIA System Management (NVSM) command-line interface. The standard verbs are show, dump, run, etc.

D. nvsm health --dump-log: This is not a valid NVSM command. The command to actively run a health check is nvsm run health, which provides a summary of the system's status, not a full log dump.

References

1. NVIDIA DGX OS User Guide (Version 6.1.0):

Section 5.5, "Dumping System Information": This section explicitly details the nvsm dump command. It states, "nvsm dump health: Collects health information and saves it to a file." This confirms that this is the designated command for creating a health report archive for troubleshooting.

Section 5.9, "Running System Checks": This section describes nvsm run health, clarifying its purpose is to execute a health check and display a summary, which is different from dumping detailed logs.

Section 5.10, "Showing System Information": This section details the nvsm show logs command, confirming its function is to display logs rather than create a diagnostic package.

Q: 14

A Slurm user needs to submit a batch job script for execution tomorrow. Which command should be used to complete this task?

Options

Discussion

Taylor N. Mar 5, 2026 1:13 am

Option A

QuietLearner2330 Feb 28, 2026 9:10 pm

Option A. since D (srun) is more for interactive and traps you if you miss script requirements.

Nora P. Feb 22, 2026 4:29 am

D for me, srun allows the -begin flag too so it should work as well. Might be mixing up interactive and batch job details though. Correct me if sbatch is the only valid pick here.

Ben B. Mar 1, 2026 2:32 am

D , srun can take the -begin option too so could work here.

Avery G. Feb 17, 2026 12:37 pm

sbatch is specifically for batch scripts so A fits best here. srun is more for interactive runs. Pretty sure that's what the question's after, but open to correction if anyone's seen a slurm update change this!

Parker M. Feb 24, 2026 5:31 pm

Why does NVIDIA keep tossing in fake commands like "submit" on these? Anyway, for scheduling a batch script to run tomorrow with Slurm, it's the sbatch -begin=tomorrow syntax. Only A matches the actual workflow you'd use. Correct me if I'm missing something but that's what I've always seen in exam reports.

Aisha P. Feb 20, 2026 8:14 pm

A for sure. Saw a nearly identical question on a practice test, and sbatch -begin=tomorrow is the way to schedule a batch job to start the next day in Slurm. B/C/D are for different purposes. Agree?

FocusedLearner5603 Mar 5, 2026 6:13 am

Probably A. Had something like this in a mock, and sbatch with -begin lets you delay batch jobs. The other options aren't for batch script submission. Let me know if you think otherwise.

Rowan Feb 16, 2026 12:35 am

A is right, not D. Official guide and scheduler docs both point to sbatch for batch job scripts specifically.

Maya D. Feb 28, 2026 12:11 am

Its A since sbatch is specifically for batch job scripts and supports the -begin option. srun and salloc are mainly for interactive uses, so pretty confident A best fits the requirement here. Disagree if you see it differently.

Be respectful. No spam.

Correct Answer:

Explanation

The sbatch command is the standard utility in Slurm for submitting a batch script for later execution. To control when the job becomes eligible to run, the --begin (or -b) option is used. This option accepts various time formats, including the specific keyword tomorrow, which defers the job's start time until the beginning of the next day (00:00:00). This command combination precisely fulfills the user's requirement to submit a script now that will be scheduled to run on the following day.

Why Incorrect

B. submit --begin=tomorrow: submit is not a valid command within the Slurm workload manager for job submission.

C. salloc --begin=tomorrow: salloc is used to allocate resources for interactive jobs, not for submitting a non-interactive batch script for deferred execution.

D. srun --begin=tomorrow: srun is used to launch parallel tasks on resources that have already been allocated, typically within an sbatch script or an salloc session.

References

1. SchedMD LLC, Slurm Workload Manager Documentation for sbatch: "The sbatch utility is used to submit a job script for later execution... --begin= Submit the batch script to the Slurm controller but defer its eligibility for scheduling until the specified time. The timespec can be a keyword such as now, midnight, noon, or tomorrow." (Reference: sbatch man page, --begin option description).

2. NVIDIA Deep Learning Institute (DLI), "Slurm for Data Scientists" Courseware: The course materials explicitly cover job submission and scheduling. Module 3, "Submitting and Managing Jobs," details the use of sbatch for submitting batch scripts and highlights the --begin flag for scheduling jobs to run at a future time, providing tomorrow as a usage example. (Reference: DLI Courseware, Module 3, Section on Job Submission Options).

Q: 15

You have noticed that users can access all GPUs on a node even when they request only one GPU in their job script using --gres=gpu:1. This is causing resource contention and inefficient GPU usage. What configuration change would you make to restrict users’ access to only their allocated GPUs?

Options

Discussion

Nathan O. Feb 28, 2026 6:25 am

Option B Seen this on other clusters, you have to set ConstrainDevices=yes in cgroup.conf or SLURM won't restrict GPU access. The other options don't deal with device isolation directly.

TaylorB Feb 21, 2026 8:07 pm

Check the official guide or a lab environment, both show B and cgroup.conf for scenarios like this.

Reese Mar 4, 2026 4:13 am

D . Modifying the job script to ask for CPUs and GPUs might help resource allocation, but unless the system actually enforces device access, users can still see all GPUs. Pretty sure that's a trap, but it seems logical at first.

Karan Q. Feb 24, 2026 2:15 pm

D . If you specify both GPUs and CPUs in the script, you're telling the scheduler to allocate more resources per job so maybe there's better overall isolation. Not totally certain, but I think that helps reduce resource contention. Anyone disagree?

Ryan W. Feb 28, 2026 11:19 pm

B, saw a similar thing in the official guide and lab environments.

NinaP Feb 13, 2026 4:59 am

Maybe D, since adding more resource requests in the script can help isolation, but B is the real trap here.

SteadyLearner8262 Feb 28, 2026 5:28 am

Not D, pretty sure it has to be B. Only B (ConstrainDevices in cgroup.conf) actually restricts GPU device access at the OS level-changing the job script in D doesn't stop jobs from seeing all GPUs. D's a common distractor here.

Skyler H. Mar 3, 2026 11:43 am

Its B here, since enabling ConstrainDevices in cgroup.conf is exactly how you tell Slurm to restrict a user's job to only the GPUs it was allocated. None of the others actually enforce device access. I remember this coming up in official training material too, but if there’s a different method folks have used, let me know.

Karan A. Feb 28, 2026 11:18 pm

Totally agree, B. Setting ConstrainDevices=yes is how you stop jobs from hogging all GPUs on the node.

Jamie U. Feb 14, 2026 11:16 am

B , only ConstrainDevices in cgroup.conf actually locks jobs to just their assigned GPUs. The other options don't enforce that kind of device isolation at all. Seen this fix used in actual Slurm configs before, but let me know if anyone's tried something different.

Be respectful. No spam.

Correct Answer:

Explanation

The problem describes a lack of resource isolation where a user's job can access GPUs not allocated to it. This is solved by enforcing resource containment using Linux Control Groups (cgroups). In the Slurm workload manager, which is commonly used in HPC environments with NVIDIA GPUs, the cgroup.conf file manages this integration. Setting the ConstrainDevices parameter to yes instructs Slurm to leverage the cgroup devices subsystem. This subsystem creates a specific allowlist of device files for the job, ensuring it can only access the exact GPU devices assigned by the scheduler, thereby preventing resource contention.

Why Incorrect

A. Increasing memory allocation is unrelated to device access control and will not prevent a process from accessing unallocated GPU devices.

C. Modifying job priority only changes the scheduling order; it does not enforce resource isolation for jobs that are already running.

D. Requesting additional CPU cores does not impose any restrictions on which GPU devices the job is permitted to access.

References

1. Official Vendor Documentation (Slurm): The official Slurm documentation for the cgroup.conf file explicitly defines the ConstrainDevices parameter. It states, "If set to 'yes', Slurm will use the devices cgroup to constrain a job's access to only the devices allocated to it." This directly addresses the question's scenario.

Source: SchedMD, Slurm Workload Manager, cgroup.conf documentation.

Reference: Parameter description for ConstrainDevices at https://slurm.schedmd.com/cgroup.conf.html.

2. Official Vendor Documentation (NVIDIA): NVIDIA's deployment guides for DGX systems, which are reference architectures for AI infrastructure, recommend using cgroups for proper resource isolation in a Slurm environment. This confirms it as a best practice.

Source: NVIDIA DGX SuperPOD with NVIDIA DGX A100 Deployment Guide.

Reference: Section 4.2.1, "Slurm Configuration," discusses the setup of cgroup.conf for resource management.

3. University Courseware/Documentation: Reputable university High-Performance Computing (HPC) centers document this configuration as essential for multi-user GPU environments. For example, the University of Florida Research Computing documentation explains that cgroups are necessary for proper GPU isolation.

Source: University of Florida Research Computing, HiPerGator User Docs.

Reference: Section "GPU Isolation" on the "Using GPUs on HiPerGator" page.

Question 11 of 20 · Page 2 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE