Free NCP-AIO Practice Test Questions and Answers (2026) | Cert Empire

Free preview: 20 questions.

NVIDIA NCP AIO

Q: 1

You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users. What mechanism is responsible for ensuring that the workload continues to run without interruption?

Options

Discussion

Aaron U. Jul 6, 2026 5:48 pm

C . Data replication (D) protects your data, but only the failover mechanism actually keeps the app running without service interruption. Some folks get tripped up by D here.

Casey K. Jul 11, 2026 7:44 pm

Going With C is correct. Saw a nearly identical question in a mock-the failover mechanism takes over automatically with no user interruption. Data replication (D) helps with integrity but not the instant switch. Open to counterpoints if I missed something.

PriyaX Jul 25, 2026 9:52 am

Option C here. Failover is the key HA feature, not D, which is just about data integrity. Trap answer for sure.

MasonD Jul 22, 2026 8:41 pm

Option C, data replication (D) sounds tempting but it's really about data not app uptime.

PracticalLead7777 Jul 9, 2026 2:19 pm

C imo, this matches what the official guide says for HA clusters. Practice tests cover similar scenarios.

Robin Q. Jul 15, 2026 1:24 pm

Option D

RowanR Jul 14, 2026 12:54 pm

Honestly, I think D. Data replication feels like the key for making sure everything keeps working, since it protects against node loss. Seems like a trap for C here.

Taylor Jul 8, 2026 11:17 pm

C here. The question points to the automatic failover doing the heavy lifting for zero downtime, not data replication or manual intervention. Pretty sure about this but open if someone else sees it differently.

Zoe Jul 17, 2026 4:35 am

I think C, saw something similar on a practice test. Failover is what keeps things running when a node drops. Agree?

Nina Jul 17, 2026 2:58 am

Makes sense to choose C. Failover is what keeps the app running if a node crashes, not data replication or manual admin.

Be respectful. No spam.

Correct Answer:

Explanation

A failover mechanism is a fundamental component of a high-availability (HA) cluster. Its primary function is to automatically detect the failure of a primary node and redirect its workload to a pre-configured standby or secondary node. This process ensures that the service or application experiences minimal to no downtime, maintaining operational continuity. The scenario described, where an application remains available after a node failure, is the direct result of a successful, automated failover operation.

Why Incorrect

A. Load balancing across all nodes in the cluster.

Load balancing distributes incoming requests among healthy nodes but does not manage the stateful transfer of a running workload from a failed node.

B. Manual intervention by the system administrator to restart services.

This contradicts the principle of automated high availability. Manual intervention would result in service interruption until the administrator takes action.

D. Data replication between nodes to ensure data integrity.

Data replication is a critical prerequisite for failover, ensuring the standby node has current data, but it is not the mechanism that executes the switchover.

References

1. Official Vendor Documentation: NVIDIA Cumulus Linux User Guide, Chapter: High Availability and Redundancy. This chapter details mechanisms like Virtual Router Redundancy (VRR) and Multi-Chassis Link Aggregation (MLAG), which are failover technologies designed to provide network service continuity by switching to redundant hardware in the event of a failure. This exemplifies the principle of automated failover in a production environment.

2. University Courseware: Patterson, D. A., & Hennessy, J. L. (2017). Computer Organization and Design RISC-V Edition: The Hardware Software Interface. Morgan Kaufmann. In Chapter 6, "Parallel Processors from Client to Cloud," the text discusses dependability via redundancy, explaining that a key technique for high availability is to use redundant nodes and "when one fails, the other can take over" (Section 6.8, p. 552). This process is defined as failover.

3. Peer-reviewed Academic Publication: A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "Basic Concepts and Taxonomy of Dependable and Secure Computing," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11-33, Jan.-Mar. 2004, doi: 10.1109/TDSC.2004.2. The paper defines fault tolerance as the ability to provide service continuity despite faults, which is achieved through error processing and fault treatment, the automated form of which is a failover mechanism.

Q: 2

You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand- alone GPU-enabled server. What must you complete before pulling the container? (Choose two.)

Options

Discussion

Chloe Jul 21, 2026 9:43 pm

Docker setup and NGC login again? This pops up on every NVIDIA exam report. D imo, since you need the API key and registry auth, and A because without proper Docker plus the NVIDIA toolkit GPUs won't pass through. Pretty sure those are it, but chime in if you see a trick here.

JamieN Jul 15, 2026 4:19 pm

Makes sense to me that it's A and D. You need Docker with NVIDIA support to run GPU containers, and logging in with an NGC API key is required before pulling. Pretty sure that's what they're after, but correct me if I'm missing something.

Ravi O. Jul 11, 2026 5:23 pm

A or D. You don’t actually need to install TensorFlow or PyTorch first since the whole point is the container includes them, so C’s a trap. B isn’t needed on a stand-alone server either. Pretty sure it’s A and D but let me know if I missed something.

Kevin R. Jul 25, 2026 10:21 pm

Maybe C and D. I'm thinking you need to have TensorFlow or PyTorch on the server, otherwise the container won't run properly, and logging into NGC with an API key is also mandatory for pulls. Not totally sure about skipping Docker install if it's standalone though. Anyone disagree?

Noah V. Jul 25, 2026 6:36 pm

A/D tbh, but if NGC switched auth to a different method or made public pulls possible for these images, D could flip. Haven't seen that yet but exam wording can get weird on prereqs.

Jordan X. Jul 20, 2026 11:58 pm

Why do some still pick C? Standard NGC containers come preloaded, no manual install needed before pulling, right?

Sara P. Jul 25, 2026 7:49 pm

A/D tbh, B is a distractor and C isn't needed if you're using standard NGC framework containers.

SharpSec3490 Jul 25, 2026 2:15 pm

A and D are the must-haves here. Docker with the NVIDIA toolkit lets you run GPU containers, and NGC API key login is needed to pull images. No need to preinstall frameworks for standard NGC containers. Pretty sure that's right but open to input.

Mason M. Jul 9, 2026 1:22 am

A/D. Pre-reqs are just Docker with NVIDIA toolkit and NGC login, nothing more for standard containers.

Jack W. Jul 7, 2026 7:40 pm

Not C, it's A and D. Installing frameworks manually is a trap here since NGC containers already have them baked in.

Be respectful. No spam.

Correct Answer:

A, D

Explanation

To deploy a GPU-accelerated container from the NVIDIA NGC registry on a standalone server, two primary prerequisites must be met. First, the server requires a container runtime (Docker) and the NVIDIA Container Toolkit. The toolkit enables the Docker runtime to expose the host's NVIDIA GPUs to the container. Second, the NGC registry (nvcr.io) requires authentication. This is accomplished by generating a personal API key from the NGC website and using it with the docker login command to authenticate the session before attempting to pull a container image.

Why Incorrect

B. Setting up a Kubernetes cluster is for container orchestration across multiple nodes and is not a requirement for running a container on a single, standalone server.

C. The primary benefit of using NGC containers is that they come pre-packaged with frameworks like TensorFlow or PyTorch, eliminating the need for manual installation on the host system.

References

1. NVIDIA NGC Documentation, "Getting Started with NGC Containers": This guide outlines the prerequisites for running NGC containers. It explicitly lists installing the NVIDIA Driver, Docker, and the NVIDIA Container Toolkit as necessary setup steps. This directly supports option A. The guide also details the procedure for logging into the NGC container registry using an API key, which supports option D.

Reference: See the "Setting Up Your System" and "Log in to the NGC Container Registry" sections in the official NGC Getting Started guide.

2. NVIDIA Cloud-Native Documentation, "NVIDIA Container Toolkit Installation Guide": This document explains the purpose of the toolkit: "The NVIDIA Container Toolkit allows users to build and run GPU accelerated containers." It is a mandatory component for enabling GPU access from within a Docker container.

Reference: See the "Introduction" section of the NVIDIA Container Toolkit documentation.

3. NVIDIA NGC Documentation, "NGC Registry User Guide": This document details the authentication process. It states, "To download containers from the NGC registry, you must have Docker installed and you must log in to the NGC registry." It then provides the exact docker login nvcr.io command and explains the use of the NGC API Key for authentication.

Reference: See the "Logging In to the NGC Registry" section.

Q: 3

A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue. What command should be used?

Options

Discussion

Liam Jul 12, 2026 8:03 pm

B . iostat is literally made for monitoring disk IO, which is exactly what the sysadmin suspects here. tcpdump would be for network, nvidia-smi for GPU, htop for processes and CPU. Pretty straightforward unless there's a trick I'm missing.

Chloe K. Jul 18, 2026 11:31 am

B . iostat is specific for checking disk IO bottlenecks. D (htop) might look helpful but it's more CPU/mem focused, not IO stats. Saw similar question in a practice test and the trap is picking D instead of B.

Jordan L. Jul 12, 2026 3:47 am

Nah, it's not D here-B is the tool for disk IO issues. htop tricks a lot of folks since it looks flashy, but it won't give detailed block device stats like iostat does. Pretty sure B is correct, anyone disagree?

Owen Jul 19, 2026 4:11 pm

B is the way to go here, since iostat directly shows disk IO bottlenecks. I remember seeing a similar scenario pop up in an exam simulation and it was always about matching the tool to the suspected hardware issue. The other options don't really give you block device stats. Pretty sure B is right, agree?

Ethan A. Jul 26, 2026 10:06 pm

I actually thought D (htop) because it shows system resource usage in real-time, including IO wait. But now realizing it doesn’t break down disk IO specifics like iostat does. Maybe I’m missing something, but htop was my first guess. Disagree?

CuriousAnalyst2045 Jul 22, 2026 3:19 am

Its B for this, but if the model data was being loaded over NFS or some network mount, you'd probably need something like tcpdump or nload to really see network IO impact, not just disk. With normal local disks though, iostat is what you want. Anyone see an edge case where htop would matter more?

Noah Jul 23, 2026 4:52 pm

Call it B for this one. iostat gives disk IO stats directly, which is what you'd want since the admin suspects IO bottlenecks. tcpdump and nvidia-smi don't fit here. Pretty confident but open if anyone disagrees.

Quinn X. Jul 11, 2026 10:15 pm

I remember a similar scenario from labs. in practice exams, pretty sure it's B here.

Skyler Q. Jul 12, 2026 5:02 pm

Option B since iostat directly shows disk IO performance, but if the workload is network-based or on a cloud volume, this might not catch everything. In some edge setups htop might hint at IO-wait spikes though.

Nora Jul 15, 2026 7:16 pm

B imo, iostat is made for monitoring disk IO directly. The others don't give block device stats so you'd miss the actual bottleneck. If it was a GPU thing then C might fit, but for disk issues it's gotta be B. Anyone see a reason to pick something else?

Be respectful. No spam.

Correct Answer:

Explanation

The system administrator's hypothesis is that a disk I/O bottleneck is causing the performance degradation. The iostat (input/output statistics) command is the standard Linux utility designed specifically to monitor system I/O device loading. It reports on CPU statistics as well as I/O statistics for block devices (i.e., disks). Key metrics provided by iostat, such as %util (device utilization), await (average time for I/O requests), and r/s / w/s (reads/writes per second), allow an administrator to directly diagnose whether the storage subsystem is saturated and causing a bottleneck for the data loading pipeline of the deep learning model.

Why Incorrect

A. tcpdump: This is a network packet analyzer used for monitoring and troubleshooting network traffic, not disk I/O performance.

C. nvidia-smi: This utility monitors NVIDIA GPU status, including utilization, memory usage, and temperature. It cannot diagnose disk-related bottlenecks.

D. htop: This is an interactive process viewer that primarily shows CPU and memory usage. It offers limited, high-level I/O information but lacks the detailed device-specific metrics iostat provides.

References

1. NVIDIA Documentation: The NVIDIA Deep Learning Performance Guide discusses identifying system-level bottlenecks, including the input pipeline. It states, "The first step in optimizing deep learning training performance is to ensure the GPUs are fully utilized... If they are not, the bottleneck is likely in the input data pipeline or the CPU." Diagnosing an input pipeline bottleneck on the system side requires tools like iostat.

Source: NVIDIA Deep Learning Performance Guide, "Is the Bottleneck the Input Pipeline?" section.

2. University Courseware: University courses on system administration and performance tuning identify iostat as the primary tool for this task. For example, course materials for high-performance computing often cover system-level profiling.

Source: University of Illinois at Urbana-Champaign, CS 433: Parallel Computer Architecture, Lecture 25 - "Measuring Performance," Slide 11. This lecture slide lists iostat as a key tool for measuring I/O performance.

3. Official Vendor Documentation (Linux): The official manual page for iostat defines its purpose clearly.

Source: iostat(1) Linux manual page. "The iostat command is used for monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates." This is available on any standard Linux distribution.

Q: 4

A system administrator wants to run these two commands in Base Command Manager. main showprofile device status apc01 What command should the system administrator use from the management node system shell?

Options

Discussion

Chloe Q. Jul 17, 2026 8:08 am

B looks close since -p is there, but I thought -p was mostly for specifying a profile or path, not chaining commands together. Shouldn't it be used only when you need to select something specific to run inside cmsh? Correct me if that's off.

Sam M. Jul 19, 2026 10:28 am

A is right here, since cmsh -c will run both commands as a single string and then exit, which matches what the system admin wants. Haven't seen cmsh-system used like in D. If anyone's seen something different, let me know.

Sanjay O. Jul 22, 2026 5:28 pm

My pick: it's A here since cmsh -c lets you run both commands in sequence straight from the shell, non-interactively. The -p flag in B isn't right for passing multiple commands, and the other options don't really line up with Base Command Manager syntax. If anyone's seen cmsh accept -p for something like this, let me know, but I doubt it.

Reese I. Jul 20, 2026 1:04 pm

C/D? Had something like this in a mock, pretty sure A is the one that works from shell without interactive mode.

Amelia Jul 17, 2026 6:18 pm

HelpfulLead6673 Jul 20, 2026 4:30 am

Seen similar on practice tests and official docs, always A. Check the admin guide for the right cmsh flags if unsure.

Robin Jul 10, 2026 1:11 am

Not B, A does it. The -c switch lets you chain both commands in a single shell call, just like the question wants. -p is for path only, so can't be used here. Pretty sure that's the right syntax but happy to hear if I'm missing something.

Nina Q. Jul 25, 2026 9:56 am

I don’t think it’s A. B looks right since -p could be for path or profile, so it might let you run sequences like that too.

Jason Q. Jul 20, 2026 7:53 pm

A is wrong, use official admin guide or lab it in Base Command Manager for this syntax.

Sofia L. Jul 21, 2026 12:06 pm

B , since -p could let you set the path and then run the commands in sequence. Maybe I'm missing a syntax thing but looks close to what you'd need. Happy to be corrected if I'm off here.

Be respectful. No spam.

Q: 5

You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require access to multiple GPUs across different nodes, but inter-node communication seems slow, impacting performance. What is a potential networking configuration you would implement to optimize inter-node communication for distributed training?

Options

Discussion

Amelia Q. Jul 8, 2026 9:15 am

Anyone used official NVIDIA docs or hands-on labs for cluster networking configs on this? Practice exams seem to push D but I've seen B suggested too.

Kevin S. Jul 19, 2026 3:30 pm

D. but if the question said you couldn't upgrade hardware, B might've been right. All about whether new gear's allowed or not.

Piya B. Jul 22, 2026 10:21 pm

D here. InfiniBand cuts latency and boosts bandwidth, which is key for distributed training jobs across nodes. B is tempting, but jumbo frames just tweak Ethernet, not the same performance jump. Pretty sure D's what they're looking for.

Jack Q. Jul 12, 2026 4:31 pm

B , saw something like this on a practice set. Jumbo frames can help with large data transfers if you're on Ethernet.

Neha Jul 8, 2026 12:55 am

Maybe B. Jumbo frames get mentioned a lot for high data throughput, especially if you're stuck on Ethernet instead of InfiniBand.

Casey Jul 6, 2026 9:07 pm

Why is nobody talking about C? Does a dedicated storage network really help distributed training comms vs InfiniBand?

Amelia J. Jul 7, 2026 6:26 am

Casey W. Jul 16, 2026 12:58 am

B or D? I was thinking B since enabling jumbo frames helps with larger AI job traffic, and not every cluster has InfiniBand installed out of the box. Pretty sure D's the high-perf answer if hardware can be upgraded, but for existing Ethernet setups, B is a common tweak. Anyone prefer B in practice?

Ben A. Jul 23, 2026 1:20 pm

InfiniBand is the key upgrade here, so D. It directly targets the latency and bandwidth issues common in distributed training jobs, whereas B (jumbo frames) only tweaks Ethernet but can't match InfiniBand performance. Pretty sure D is right unless there's a restriction on hardware changes.

Ethan Jul 19, 2026 4:50 am

D is right here since InfiniBand is built for this kind of low-latency, high-throughput traffic between nodes, perfect for distributed AI training. B's a common trap if you're thinking Ethernet only, but nothing in the question says you can't use better hardware. Pretty sure about D but open to pushback.

Be respectful. No spam.

Correct Answer:

Explanation

InfiniBand is a high-performance computing (HPC) interconnect that provides significantly higher throughput and lower latency than standard Ethernet. For distributed AI training, where frequent and large-volume gradient exchanges occur between nodes, network performance is critical. InfiniBand utilizes Remote Direct Memory Access (RDMA), which allows GPUs on different nodes to communicate directly, bypassing the CPU and kernel network stack. This minimizes communication overhead and latency, preventing GPUs from idling while waiting for data, thereby directly addressing the performance bottleneck in multi-node training scenarios.

Why Incorrect

A. Increasing job replicas scales out the number of independent training instances but does not improve the communication performance within a single, distributed training job.

B. While jumbo frames can slightly reduce packet overhead on Ethernet, this interconnect still has fundamentally higher latency and lower bandwidth compared to InfiniBand for HPC workloads.

C. A dedicated storage network optimizes the I/O for loading datasets from storage, not the inter-node, inter-GPU communication required for the training algorithm's synchronization steps.

References

1. NVIDIA Corporation. (2020). NVIDIA DGX A100 System Architecture Whitepaper.

Reference: Page 13, Section "Networking for Scale-Out".

Quote: "For scaling out, the DGX A100 system has eight single-port Mellanox ConnectX-6 VPI HDR InfiniBand adapters... These provide 200 Gb/s of bandwidth per adapter for clustering. This massive bandwidth is required to feed the multi-GPU training jobs that run on DGX A100." This document establishes InfiniBand as the standard, high-performance interconnect for scaling out NVIDIA's flagship AI systems.

2. NVIDIA Developer Documentation. (2023). NVIDIA Collective Communication Library (NCCL) Documentation.

Reference: Introduction / Overview Section.

Content: The documentation states that NCCL is "optimized to achieve high bandwidth and low latency... over NVIDIA Mellanox InfiniBand and Ethernet networking for multi-node." This confirms that the core library for distributed training is specifically optimized for high-speed interconnects like InfiniBand.

3. Sato, K., et al. (2021). Demystifying the Performance of HPC Cloud for Deep Learning. 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

Reference: Section IV-B, "Inter-node Communication".

DOI: https://doi.org/10.1109/IPDPSW52791.2021.00090

Content: The study analyzes communication performance for deep learning and notes that interconnects supporting RDMA, such as InfiniBand, provide significantly higher performance for distributed training by reducing communication overhead compared to traditional TCP/IP over Ethernet.

Q: 6

You are managing an on-premises cluster using NVIDIA Base Command Manager (BCM) and need to extend your computational resources into AWS when your local infrastructure reaches peak capacity. What is the most effective way to configure cloudbursting in this scenario?

Options

Discussion

Meera Jul 22, 2026 10:06 pm

D is the right call here. Cluster Extension in BCM automates the whole cloudbursting process, so you don't have to manually spin up AWS nodes like in B. That's what most of the official guides point toward for efficiency.

FriendlyNeteng729 Jul 20, 2026 12:09 pm

This comes down to what "most effective" really means here. D is right since BCM's Cluster Extension actually automates spinning up AWS nodes only when you hit the local limit, so it saves time and manual effort compared to option B. But if you had some advanced compliance or network config that BCM automation can't handle, B could technically be safer in rare cases. For most setups though, I'm pretty sure D is what they want.

Riley M. Jul 11, 2026 4:35 pm

Call it it's D since BCM's Cluster Extension actually automates the provisioning of AWS nodes when local capacity maxes out. Manual options like B work but aren't the most effective for seamless cloudbursting. Open to other views if anyone’s seen different on practice exams.

Rowan X. Jul 16, 2026 11:10 pm

D imo, Cluster Extension is built for this. Options B and C both want you to do manual work which defeats the point of cloudbursting. A is a distractor, since BCM's load balancer alone can't handle auto-provisioning into AWS. Seen similar in practice questions, so pretty sure D is right.

Adam W. Jul 23, 2026 10:00 am

B . Manual provisioning in AWS gives you full control over the new cloud nodes, which could help prevent unexpected behavior. Not totally sure though, since automation is nice for speed.

SteadyCandidate3000 Jul 23, 2026 10:58 pm

D imo. Cluster Extension handles auto cloud resource provisioning with BCM, so it's less work than manual options like B. Pretty sure that's what "most effective" means for these hybrid setups. Let me know if you disagree.

PreciseConsultant4394 Jul 14, 2026 12:42 am

I saw a similar question on a practice test, and I was debating between B and D. Isn’t manually provisioning in AWS (B) more reliable if you want tighter control? BCM’s automation sounds good but can be unpredictable sometimes, right?

CalmLearner6009 Jul 11, 2026 3:08 am

Why not B? Manually spinning up AWS nodes lets you control timing and which resources are used, so I’d think it could be the most effective if you want to avoid surprises.

Jordan X. Jul 16, 2026 7:37 am

Nah, it's not B here. D is right since Cluster Extension automates cloudbursting, while B is way too manual for "most effective".

Adam L. Jul 8, 2026 9:23 am

Had something similar in a mock recently and the pick was D. Cluster Extension is built for automatic cloudbursting so you don't have to manage node spin-up manually. Pretty sure that's what they mean by most effective here, but let me know if you see it differently.

Be respectful. No spam.

Correct Answer:

Explanation

NVIDIA Base Command Manager (BCM) is designed to simplify the management of hybrid HPC and AI clusters. Its "Cluster Extension" feature provides an automated, policy-driven mechanism for cloudbursting. When the on-premises Slurm queue has pending jobs and local compute resources are fully utilized, BCM automatically provisions pre-configured nodes in a public cloud like AWS. These cloud nodes are seamlessly joined to the on-premises cluster to run the workloads. Once the jobs are complete, BCM automatically de-provisions the cloud resources, optimizing for cost and efficiency. This automation makes it the most effective method.

Why Incorrect

A. BCM requires significant pre-configuration to integrate with a cloud provider; it is not a zero-configuration process. Also, its purpose is bursting, not continuous even distribution.

B. Manual provisioning is slow, prone to human error, and defeats the purpose of an automated management platform like BCM, making it highly ineffective for dynamic scaling.

C. A manual switchover to a standby deployment is a disaster recovery or migration strategy, not an efficient, on-demand cloudbursting method for handling peak loads.

References

1. NVIDIA Base Command Platform User Guide (Version 23.10), Chapter 10: Using a Hybrid On-Premises and Cloud Cluster. The guide states, "When there are jobs in the queue and no on-premises nodes are available, BCM will automatically provision nodes in the cloud to run the jobs. When the jobs are complete, the nodes in the cloud are terminated." This directly supports the automated process described in option D.

2. NVIDIA Base Command Manager Deployment Guide (Version 23.10), Chapter 11: Configuring a Hybrid On-Premises and Cloud Cluster. This chapter details the extensive configuration required for cloud integration (e.g., cloud credentials, network setup, node images), which directly refutes the "without any pre-configuration" claim in option A.

Q: 7

You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering. How would you ensure that only the intended GPUs are allocated to jobs?

Options

Discussion

Hannah S. Jul 14, 2026 12:28 am

Makes sense to pick A. Direct config in gres.conf and slurm.conf is how GPU allocation is actually controlled with Slurm.

Emma Jul 22, 2026 1:24 pm

A. seeing this in recent exam reports too, configs through gres.conf are what actually restricts which GPUs Slurm uses.

Zoe T. Jul 26, 2026 7:41 pm

Same, I'd pick A here. Only listing the right GPUs in gres.conf actually controls which ones Slurm will allocate. The other choices don’t prevent jobs from landing on display GPUs. Pretty sure that's what the exam wants.

Maya L. Jul 12, 2026 7:15 pm

Option A Official Slurm admin guide and most practice exams highlight gres.conf config for this. Labs emphasize it too.

SeasonedDev5611 Jul 17, 2026 5:24 pm

A , official Slurm docs and admin labs push configuring gres.conf and slurm.conf for this exact scenario.

Hannah Jul 11, 2026 3:26 pm

A tbh, but watch out-if gres.conf lists a GPU ID that’s swapped by the OS (like after reboot), Slurm might still allocate the wrong one. Seen that tripping people up in similar exam questions. Anyone disagree?

Sean Jul 14, 2026 4:23 pm

A imo, saw this on a similar practice exam. Official guide covers gres.conf usage.

Leo Jul 21, 2026 2:11 am

Pretty sure it's A since Slurm relies on gres.conf to control which GPUs are visible for scheduling. Manual steps like nvidia-smi (B) or reinstalling drivers won’t restrict allocation, you have to exclude the display GPUs via config. Agree?

SaraU Jul 12, 2026 9:54 pm

Is there any scenario where B would even work reliably? I always thought nvidia-smi assignments can't enforce job-level GPU isolation the way Slurm's gres.conf does. Seems like manual GPU allocation just isn't scalable for cluster setups.

Luna F. Jul 10, 2026 12:06 pm

A tbh, D feels like a trap since just increasing GPU requests doesn’t solve the config issue long-term. A is the standard way via gres.conf and slurm.conf control, but if anyone’s had success with another method let me know.

Be respectful. No spam.

Correct Answer:

Explanation

The standard and correct method for managing specific GPU device allocation in Slurm is through its Generic Resource (GRES) configuration. The slurm.conf file is used to declare that a node has GPUs available as a resource. The gres.conf file is then used to specify precisely which GPU devices on that node should be managed by Slurm. To exclude a GPU (e.g., one used for display), the administrator must explicitly list only the desired compute GPUs in gres.conf using their device files (e.g., /dev/nvidia0), indices, or UUIDs. GPUs not listed in this file will be ignored by the Slurm scheduler and will not be allocated to jobs.

Why Incorrect

B. nvidia-smi is a monitoring and management utility. It does not integrate with the Slurm scheduler to control job resource allocation, which is determined by Slurm's own configuration.

C. The problem described is a resource allocation policy issue, not a hardware detection failure. Reinstalling drivers will not change which GPUs Slurm is configured to manage.

D. Requesting more GPUs in a job script does not influence the scheduler's choice of which specific devices to allocate from the pool of GPUs it manages.

References

1. Official Slurm Documentation (SchedMD): The gres.conf manual page explicitly details how to configure specific devices. The configuration Name=gpu File=/dev/nvidia[0-6] on an 8-GPU node would make /dev/nvidia7 unavailable to Slurm.

Source: SchedMD LLC, gres.conf - Slurm configuration file for generic resource management., Section: "EXAMPLES". (URL: https://slurm.schedmd.com/gres.conf.html)

2. Official Slurm Documentation (SchedMD): The slurm.conf manual page describes the Gres parameter within a Node definition, which is required to associate the node with the resources defined in gres.conf.

Source: SchedMD LLC, slurm.conf - Slurm configuration file., Section: "NodeName", Parameter: "Gres". (URL: https://slurm.schedmd.com/slurm.conf.html)

3. NVIDIA Official Documentation: The NVIDIA Base Command Platform Deployment Guide provides a concrete example of configuring slurm.conf and gres.conf for GPU nodes, demonstrating the linkage between the two files for proper resource management.

Source: NVIDIA Base Command Platform On-Premises Deployment Guide, Version 23.08, Chapter 4: "Slurm Configuration", Section: "Configure Slurm". This section shows example configurations for both files.

Q: 8

An organization has multiple containers and wants to view STDIN, STDOUT, and STDERR I/O streams of a specific container. What command should be used?

Options

Discussion

Casey Jul 21, 2026 2:20 pm

C . Only docker logs gives you STDOUT and STDERR from the container, the others are for stats, process info or config details. Not 100 percent on the STDIN part but C matches most exam reports.

Anita Jul 11, 2026 7:17 am

C . Only docker logs CONTAINER-NAME will show you STDOUT and STDERR for a specific container, that's what they're after here.

Aaron B. Jul 13, 2026 9:52 pm

C tbh, "inspect" (D) looks good at first but that's more config info. Only C actually displays the output streams requested. They probably put STDIN in as a trick.

JackD Jul 16, 2026 8:45 pm

Yeah C, docker logs CONTAINER-NAME is how you view STDOUT/STDERR for a container. Pretty sure that's what they're after here.

Jack B. Jul 7, 2026 5:06 pm

Option D I was thinking docker inspect shows everything about the container, so you might find the logs there. Not 100% sure, maybe missing something obvious with inspect output. If anyone has tried this recently let me know.

Liam Jul 14, 2026 10:21 pm

C all day, docker logs is how you pull STDOUT/STDERR from a running container. The others don't actually show the output, just stats or info. Not 100% on STDIN but I think this is what they're after. Agree?

Karan R. Jul 16, 2026 2:11 pm

Not sure I agree with everyone jumping straight to C. D is tempting since inspect gives a ton of details, but only C (docker logs) shows STDOUT and STDERR by default. Maybe "STDIN" in the question is a trap. Going with C.

Morgan J. Jul 23, 2026 10:23 pm

Its C, docker logs lets you check STDOUT/STDERR for the container. Other commands won't show those output streams directly.

Cameron J. Jul 17, 2026 5:35 pm

Not B, C. Only docker logs can show STDOUT and STDERR output for a specific container.

Amelia Jul 17, 2026 5:56 pm

C, Not totally sure because if they wanted to see all streams live including STDIN, maybe something else is needed, but for STDOUT/STDERR logs docker logs does the job. Anyone disagree?

Be respectful. No spam.

Correct Answer:

Explanation

The docker logs command is the standard utility for fetching the logs of a container. It aggregates and displays the standard output (STDOUT) and standard error (STDERR) streams generated by the main process running inside the specified container. This allows an administrator or developer to view the real-time or historical output for debugging and monitoring purposes, directly addressing the requirement to view the container's I/O streams.

Why Incorrect

docker top CONTAINER-NAME: This command lists the running processes inside a container, similar to the top command in Linux, but does not show the I/O streams.

docker stats CONTAINER-NAME: This command provides a live stream of a container's resource usage statistics, such as CPU and memory, not its application logs.

docker inspect CONTAINER-NAME: This command returns detailed, low-level metadata about a container's configuration in JSON format, not the application's output.

References

1. Official Docker Documentation, docker logs command reference: "The docker logs command fetches the logs of a container. The command shows the STDOUT and STDERR of the main process of the container."

Source: Docker Docs, engine/reference/commandline/logs/

2. Official Docker Documentation, docker top command reference: "Display the running processes of a container."

Source: Docker Docs, engine/reference/commandline/top/

3. MIT Missing Semester of Your CS Education, Courseware: "To see the output of the command that was run in the container, you can use docker logs . Each time you run the command, it will print the full output from the command."

Source: MIT CSAIL, Missing Semester, 2020, "Containers", Section: "Docker".

4. Official Docker Documentation, docker stats command reference: "Display a live stream of container(s) resource usage statistics."

Source: Docker Docs, engine/reference/commandline/stats/

Q: 9

You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI. To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?

Options

Discussion

Drew C. Jul 12, 2026 4:08 pm

Option C flips things here. If you don't have cluster-admin rights in your kubeconfig, the Run:AI CLI basically can't automate anything meaningful across nodes. Saw a similar catch on a practice quiz, so pretty sure that's what they're looking for. Happy to be challenged if someone has made scripting work without it.

Ajay W. Jul 9, 2026 3:01 pm

Makes sense to pick C. Without a kubeconfig set with admin rights, the CLI can't automate tasks cluster-wide. Pretty sure that's essential for scripting in these setups, unless I'm missing something.

HelpfulAnalyst7315 Jul 17, 2026 7:13 am

C. saw this in similar practice sets. Official docs and admin guides stress kubeconfig with cluster-admin for automation.

CalmNeteng5406 Jul 12, 2026 9:20 am

C . Nothing works without cluster-admin kubeconfig if you're automating Run:AI at scale.

Nora B. Jul 11, 2026 1:05 pm

Definitely C here. You can't automate Run:AI admin tasks across the cluster unless your kubeconfig is set up with cluster-admin rights.

Piya D. Jul 23, 2026 2:11 am

I'd actually go with B. Allocating specific GPUs per job just feels more hands-on for managing resources, especially in scripts.

Sofia L. Jul 18, 2026 3:53 pm

C makes sense here, you need a kubeconfig with admin rights or the CLI just can't talk to all the nodes for automation. Pretty standard setup for K8s admin tools. If someone disagrees let me know, but I think that's spot on.

PracticalMentor1884 Jul 10, 2026 9:15 am

C saw this or something close on an exam report, admin kubeconfig is required for scripting with runai-adm.

Grace W. Jul 20, 2026 2:39 pm

Its C, you definitely need the kubeconfig with admin rights before automating anything with the CLI.

Grace T. Jul 20, 2026 7:47 am

Its C for sure. For scripting or automation you need kubeconfig with cluster-admin rights or the CLI just won't function.

Be respectful. No spam.

Correct Answer:

Explanation

The Run:AI Administrator Command-Line Interface (runai-adm) interacts directly with the Kubernetes API to manage the Run:AI control plane components and resources. To perform cluster-wide administrative tasks such as creating Projects, managing users, or configuring system-wide settings, the CLI requires authenticated and authorized access. This is achieved through a Kubernetes configuration file (kubeconfig) that contains credentials with cluster-admin privileges. Without these administrative rights, the API server will reject the requests, rendering the CLI non-functional for its intended purpose and making automation impossible.

Why Incorrect

A. The runai-adm CLI is a high-level abstraction for Run:AI objects but still relies on authenticated communication with the Kubernetes API server; it does not bypass it or replace kubectl for all node-level tasks.

B. Manually allocating specific GPUs is a granular control feature, not an essential prerequisite for automation. A key benefit of Run:AI is automating the scheduling and allocation of GPU resources.

D. The Run:AI CLI is a cross-platform tool available for Windows, Linux, and macOS. Scripting and automation can be performed on any supported operating system, not just Windows.

References

1. NVIDIA Run:AI Documentation, "Install the Run:AI Command-Line Interface": This official guide specifies the prerequisites for installing and using the administrator CLI. It states, "To install and use the Run:AI administrator CLI you must have a Kubernetes configuration file with cluster administrative rights." This directly supports the necessity of a properly configured kubeconfig file with administrative privileges. (Found in the Administrator Setup section of the official Run:AI product documentation).

2. Kubernetes Documentation, "Organizing Cluster Access Using kubeconfig Files": This document explains that tools interacting with a Kubernetes cluster, like the Run:AI CLI, use kubeconfig files to find cluster information and credentials. For administrative tasks, the user or service account specified in the config must be bound to a role with sufficient permissions, such as cluster-admin. (See sections on Contexts and Users).

Q: 10

A system administrator of a high-performance computing (HPC) cluster that uses an InfiniBand fabric for high-speed interconnects between nodes received reports from researchers that they are experiencing unusually slow data transfer rates between two specific compute nodes. The system administrator needs to ensure the path between these two nodes is optimal. What command should be used?

Options

Discussion

Luke I. Jul 14, 2026 1:33 am

Option A

Ishaan I. Jul 19, 2026 4:39 am

A . The question wants to find the actual path packets take between two nodes, which is exactly what ibtracert does-like traceroute for InfiniBand. D (ibnetdiscover) just gives you the general topology, not the node-to-node path. Easy trap there if you're not careful! Open to any counterpoints but pretty sure it's A.

Owen B. Jul 19, 2026 5:48 pm

Option A

ArjunO Jul 23, 2026 8:40 am

A . Official guide and hands-on lab examples both point to ibtracert for path analysis between nodes, not just overall mapping.

Ishaan P. Jul 18, 2026 3:12 am

I've seen similar in exam reports, it's A for node-to-node path checks.

Layla O. Jul 6, 2026 5:08 pm

A
Yeah, for just connectivity testing ibping (C) is quick, but here you need to see the actual path between nodes. ibtracert shows all the hops in the InfiniBand fabric which helps spot if routing or cabling is off. Pretty sure A fits best for troubleshooting slow routes. Correct me if I missed something.

Luke N. Jul 21, 2026 8:32 pm

Why not D? It looks like a topology discovery trap, but for tracing specific node-to-node paths A is what you want.

ChrisC Jul 15, 2026 3:11 pm

Maybe A. Had something like this in a mock and ibtracert was needed for actual path tracing not just general status.

DirectReviewer6820 Jul 7, 2026 5:10 am

A tbh. D is tempting but it's more for topology, not node-to-node path trace.

Ajay Jul 22, 2026 12:37 pm

Don’t think it’s B or D. A is the only one that actually traces the path between two hosts, which is what you want for slow data between specific nodes. D gives you the whole topology, but not hop-by-hop between nodes. Correct me if I’m off here.

Be respectful. No spam.

Correct Answer:

Explanation

The ibtracert command is the InfiniBand equivalent of the TCP/IP traceroute utility. It is specifically designed to discover and display the path that packets take from a source port to a destination port within the InfiniBand fabric. By tracing the hops (switches) between the two specified compute nodes, a system administrator can verify the exact route, identify potential bottlenecks, and determine if the path is sub-optimal. This directly addresses the need to ensure the path is optimal and troubleshoot slow transfer rates between two specific endpoints.

Why Incorrect

B. ibstatus: This command queries the status of the local Host Channel Adapter (HCA) and does not provide any information about paths to other nodes in the fabric.

C. ibping: This tool verifies basic point-to-point connectivity and measures latency, but it does not reveal the network path taken between the two nodes.

D. ibnetdiscover: This command performs a full scan of the fabric to discover and display the entire network topology, which is not specific to a single path between two nodes.

References

1. NVIDIA Mellanox OFED for Linux User Manual (LTS v5.8-1.1.2.1), Chapter 10: Diagnostic Tools, Section 10.10, "ibtracert". The manual states, "ibtracert uses SMPs to trace the path from a source to a destination... It discovers the path and prints the hops from the source to the destination." This confirms its function for path tracing.

2. Ohio Supercomputer Center (OSC) Documentation, "Verifying InfiniBand Fabric". This university-affiliated supercomputing center guide explains the usage of InfiniBand diagnostic tools. It describes ibtracert as the command to "trace the path from the current node to another node in the fabric," validating its use for the scenario described.

3. Linux man pages (infiniband-diags), ibtracert(8). The official man page describes the utility's purpose: "trace the path from a source GID/LID to a destination GID/LID." This is the primary function required by the administrator in the question.

Q: 11

You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command Manager (BCM). The system is experiencing slow performance, and you need to identify the cause. What is the most effective way to monitor GPU usage across nodes?

Options

Discussion

Sara Jul 8, 2026 11:44 pm

Had a similar scenario in practice, isn't it B since Base View gives you the whole cluster view?

Luke I. Jul 20, 2026 6:52 am

Probably B since the dashboard lets you see GPU stats for the whole cluster in real-time. D is a classic approach but way too manual for monitoring a SuperPOD. Base View just gives better visibility here, correct me if I'm missing something.

Owen Y. Jul 16, 2026 1:17 pm

D imo. nvidia-smi is pretty much the go-to for GPU stats, and if you want actual numbers per node, it's reliable. The dashboard might be nice for a big picture but honestly, running nvidia-smi gives you the live GPU utilization directly where the jobs run. Maybe not as centralized, but more granular I think. Let me know if that's off.

Grace P. Jul 10, 2026 5:41 am

B makes sense, the dashboard gives you real-time cluster stats so you’re not checking every node one by one.

Skyler Y. Jul 14, 2026 11:38 am

D , nvidia-smi on each node has always shown me exact GPU usage before.

Zoe R. Jul 9, 2026 7:20 pm

Has anyone here actually used the Base View dashboard in BCM for a live cluster? Wondering if it shows real-time GPU stats for all nodes at once or if you still need to check per node sometimes. Practice exams usually point toward dashboards but curious about your real-world experience with this tool.

Riley X. Jul 21, 2026 11:11 am

B for sure, since Base View dashboard gives that real-time cluster-wide GPU usage. D is too hands-on for a big setup.

Vikram U. Jul 6, 2026 11:45 pm

Honestly I would've picked D at first since nvidia-smi is the classic GPU check for each node. But for cluster-wide real-time visibility, that's too manual. So tempting trap here with D.

Quinn V. Jul 10, 2026 11:16 am

Grace M. Jul 11, 2026 5:25 pm

Nah, it's B here-Base View dashboard does all nodes real-time. D is just per-node, easy to miss that.

Be respectful. No spam.

Correct Answer:

Explanation

NVIDIA Base Command Manager (BCM) is the designated software for managing and monitoring DGX SuperPOD clusters. Its Base View dashboard is specifically designed to provide a centralized, real-time, graphical overview of key performance metrics across all nodes. This includes GPU utilization, GPU memory, CPU usage, and system memory. Using this dashboard is the most effective and efficient method for identifying system-wide performance bottlenecks, such as those caused by GPU resource contention, without needing to manually access individual nodes.

Why Incorrect

A. Checking Slurm logs is useful for diagnosing job scheduling or resource allocation errors, but it does not provide real-time utilization metrics for active hardware.

C. The top command monitors CPU and memory usage but provides no information about GPU utilization, which is a critical component in a DGX cluster.

D. Using nvidia-smi on each node is a valid way to check GPU status, but it is highly inefficient and not scalable for a multi-node SuperPOD cluster.

References

1. NVIDIA Base Command Manager User Guide (v3.2.0): In Chapter 3, "Dashboards," the "Base View" section describes this dashboard as the primary tool for a "high-level overview of the cluster, including GPU, CPU, and memory utilization... essential for quickly identifying system-wide performance bottlenecks."

2. NVIDIA DGX SuperPOD Reference Architecture: The "Management and Monitoring" section details the software stack, emphasizing that NVIDIA Base Command Manager provides the unified, cluster-wide monitoring capabilities necessary for operating the system. It highlights the dashboards for visualizing aggregate system health and performance.

Q: 12

You need to do maintenance on a node. What should you do first?

Options

Discussion

PracticalSec7738 Jul 19, 2026 2:45 am

A . Draining with scontrol update is the safe move so jobs aren't cut off. B and C look like trick options since taking a node down too soon can kill running tasks.

Taylor Jul 17, 2026 5:13 am

A . The official guide and hands-on labs both emphasize draining with scontrol first so you don’t disrupt running jobs. B and C get chosen a lot, but only right for immediate shutdown cases I think. Anyone disagree?

Sean P. Jul 17, 2026 5:13 am

Feels like A, draining with scontrol is the correct first step for standard maintenance. B and C are tempting but setting to down too soon would just kill any running processes, which usually isn't what you want unless it's an emergency. Pretty sure that's what Slurm docs recommend too, but correct me if I'm missing a recent change.

Nathan Jul 20, 2026 9:24 am

I'd pick B here.

OwenB Jul 18, 2026 8:59 am

Option A draining with scontrol update, is the safe call in normal maintenance. That lets current jobs finish before you take the node down, so no running stuff gets lost. B and C jump straight to DOWN and might kill jobs, which is a common mistake folks make here. Pretty sure it's A unless they want urgent shutdown?

SteadyLearner6087 Jul 10, 2026 1:59 pm

I keep seeing B in some practice sets and the official guide, so I went with B.

DirectArchitect1873 Jul 13, 2026 5:11 pm

A for sure here. Draining the node lets jobs finish before you do anything else, so nothing important gets interrupted.

Quinn Y. Jul 25, 2026 6:40 pm

C tbh, since both B and C say set to down before maintenance. If you need jobs to stop immediately, setting to down ends them right away, which can be necessary for urgent patching. But I might be missing a nuance with scontrol drain vs down.

Emma C. Jul 24, 2026 5:09 pm

Why are B and C both listed if they say the same thing? Looks like a distractor.

Aaron Jul 24, 2026 8:21 am

Need to prevent new jobs hitting the node first, so A. That way existing jobs finish and Slurm handles it smoothly.

Be respectful. No spam.

Correct Answer:

Explanation

The correct first step for planned maintenance on a compute node is to place it in the DRAIN state using the scontrol update command. This is the standard, non-disruptive procedure in Slurm. The DRAIN state prevents new jobs from being allocated to the node while allowing any currently running jobs to complete. Once all jobs have finished, the node's state transitions to IDLE+DRAIN, indicating it is safe to take offline for maintenance without interrupting user workloads.

Why Incorrect

B/C. Setting the node state to down is an abrupt action typically reserved for unexpected node failures. It can cause running jobs to be terminated and requeued, which is disruptive and not ideal for planned maintenance.

D. Disabling job scheduling on all compute nodes is excessive and unnecessary for maintenance on a single node. This action would halt the productivity of the entire cluster.

References

1. NVIDIA DGX SuperPOD with NVIDIA DGX A100 Deployment Guide (v1.1)

Section 5.1.1. Taking Nodes Offline for Maintenance: "To take a node offline for maintenance, the best practice is to drain the node. Draining a node prevents new jobs from being scheduled on the node but allows currently running jobs to complete. To drain a node, run the following command: scontrol update nodename= state=drain reason="maintenance""

2. SchedMD (Official Slurm Documentation), scontrol Man Page

Section: update command, NodeName parameter: "State=DRAIN - Indicate that a node is unavailable for use. No new jobs will be allocated to the node. Jobs already running on the node will be allowed to complete. The node state will be changed to idle when all of the jobs on it have completed." This contrasts with State=DOWN, which is described as for a node that is "unavailable for use" and often set automatically when a node fails.

Q: 13

A DGX H100 system in a cluster is showing performance issues when running jobs. Which command should be run to generate system logs related to the health report?

Options

Discussion

Robin W. Jul 8, 2026 6:13 pm

C . nvsm dump health collects a full health report with detailed diagnostics, not just regular logs. I think this is what support usually asks for when there are DGX performance issues. Let me know if you’ve seen it used differently.

Taylor C. Jul 8, 2026 8:41 pm

Option C but honestly not super sure on this one. Anyone else seen this before?

MethodicalReviewer8900 Jul 14, 2026 1:09 am

C , nvsm dump health grabs the full diagnostics with health-related logs. B only gets basic logs, but for actual health report troubleshooting C is more complete. I think that's what the question's after.

Morgan Jul 16, 2026 12:49 pm

D imo, nvsm health -dump-log looks like it would be the one to create logs tied directly to system health. Seems logical for troubleshooting health-related issues. Not fully sure since the command options can get confusing, but that's what I'd use.

Jamie Jul 13, 2026 6:37 pm

Its C, though B tricks you since it looks like a straight log pull but won't grab the diagnostic package.

Kevin M. Jul 25, 2026 4:20 pm

Feels like B. That command grabs system logs so seems like a fit if you just want to check logs for health-related issues.

Nina F. Jul 7, 2026 6:40 am

C tbh

Meera R. Jul 6, 2026 7:08 pm

C . Saw a similar question in other practice sets and nvsm dump health is the one that bundles health diagnostics plus logs in a package for support, not just raw logs like B. Easy to miss since B looks tempting but it's not as comprehensive. Open to corrections if anyone's got docs showing otherwise.

Casey N. Jul 6, 2026 9:45 pm

Anybody tried using D here? I swear the official guide listed something like nvsm health -dump-log for health reports. Practice tests might help confirm this.

Amelia H. Jul 16, 2026 8:01 am

Its B. I remember seeing nvsm get logs for pulling system logs during troubleshooting, not option C.

Be respectful. No spam.

Correct Answer:

Explanation

The nvsm dump health command is the correct utility for collecting a comprehensive set of system health and diagnostic information on an NVIDIA DGX system. When executed, it gathers logs, hardware status, software configuration, and performance data, packaging it into a single compressed archive file (.tar.gz). This file is essential for support teams and system administrators to perform an in-depth, offline analysis of system issues, such as performance degradation, without needing direct access to the live system.

Why Incorrect

A. nvsm show logs --save: This command is used to display specific system logs (e.g., event logs) to the console. While the output can be redirected, it does not generate the comprehensive diagnostic bundle that nvsm dump health creates.

B. nvsm get logs: This is not a valid command or syntax within the NVIDIA System Management (NVSM) command-line interface. The standard verbs are show, dump, run, etc.

D. nvsm health --dump-log: This is not a valid NVSM command. The command to actively run a health check is nvsm run health, which provides a summary of the system's status, not a full log dump.

References

1. NVIDIA DGX OS User Guide (Version 6.1.0):

Section 5.5, "Dumping System Information": This section explicitly details the nvsm dump command. It states, "nvsm dump health: Collects health information and saves it to a file." This confirms that this is the designated command for creating a health report archive for troubleshooting.

Section 5.9, "Running System Checks": This section describes nvsm run health, clarifying its purpose is to execute a health check and display a summary, which is different from dumping detailed logs.

Section 5.10, "Showing System Information": This section details the nvsm show logs command, confirming its function is to display logs rather than create a diagnostic package.

Q: 14

A Slurm user needs to submit a batch job script for execution tomorrow. Which command should be used to complete this task?

Options

Discussion

Taylor N. Jul 26, 2026 8:41 pm

Option A

QuietLearner2330 Jul 22, 2026 4:38 pm

Option A. since D (srun) is more for interactive and traps you if you miss script requirements.

Nora P. Jul 15, 2026 11:56 pm

D for me, srun allows the -begin flag too so it should work as well. Might be mixing up interactive and batch job details though. Correct me if sbatch is the only valid pick here.

Ben B. Jul 22, 2026 9:59 pm

D , srun can take the -begin option too so could work here.

Avery G. Jul 11, 2026 8:04 am

sbatch is specifically for batch scripts so A fits best here. srun is more for interactive runs. Pretty sure that's what the question's after, but open to correction if anyone's seen a slurm update change this!

Parker M. Jul 18, 2026 12:58 pm

Why does NVIDIA keep tossing in fake commands like "submit" on these? Anyway, for scheduling a batch script to run tomorrow with Slurm, it's the sbatch -begin=tomorrow syntax. Only A matches the actual workflow you'd use. Correct me if I'm missing something but that's what I've always seen in exam reports.

Aisha P. Jul 14, 2026 3:41 pm

A for sure. Saw a nearly identical question on a practice test, and sbatch -begin=tomorrow is the way to schedule a batch job to start the next day in Slurm. B/C/D are for different purposes. Agree?

FocusedLearner5603 Jul 27, 2026 1:40 am

Probably A. Had something like this in a mock, and sbatch with -begin lets you delay batch jobs. The other options aren't for batch script submission. Let me know if you think otherwise.

Rowan Jul 9, 2026 8:02 pm

A is right, not D. Official guide and scheduler docs both point to sbatch for batch job scripts specifically.

Maya D. Jul 21, 2026 7:38 pm

Its A since sbatch is specifically for batch job scripts and supports the -begin option. srun and salloc are mainly for interactive uses, so pretty confident A best fits the requirement here. Disagree if you see it differently.

Be respectful. No spam.

Correct Answer:

Explanation

The sbatch command is the standard utility in Slurm for submitting a batch script for later execution. To control when the job becomes eligible to run, the --begin (or -b) option is used. This option accepts various time formats, including the specific keyword tomorrow, which defers the job's start time until the beginning of the next day (00:00:00). This command combination precisely fulfills the user's requirement to submit a script now that will be scheduled to run on the following day.

Why Incorrect

B. submit --begin=tomorrow: submit is not a valid command within the Slurm workload manager for job submission.

C. salloc --begin=tomorrow: salloc is used to allocate resources for interactive jobs, not for submitting a non-interactive batch script for deferred execution.

D. srun --begin=tomorrow: srun is used to launch parallel tasks on resources that have already been allocated, typically within an sbatch script or an salloc session.

References

1. SchedMD LLC, Slurm Workload Manager Documentation for sbatch: "The sbatch utility is used to submit a job script for later execution... --begin= Submit the batch script to the Slurm controller but defer its eligibility for scheduling until the specified time. The timespec can be a keyword such as now, midnight, noon, or tomorrow." (Reference: sbatch man page, --begin option description).

2. NVIDIA Deep Learning Institute (DLI), "Slurm for Data Scientists" Courseware: The course materials explicitly cover job submission and scheduling. Module 3, "Submitting and Managing Jobs," details the use of sbatch for submitting batch scripts and highlights the --begin flag for scheduling jobs to run at a future time, providing tomorrow as a usage example. (Reference: DLI Courseware, Module 3, Section on Job Submission Options).

Q: 15

You have noticed that users can access all GPUs on a node even when they request only one GPU in their job script using --gres=gpu:1. This is causing resource contention and inefficient GPU usage. What configuration change would you make to restrict users’ access to only their allocated GPUs?

Options

Discussion

Nathan O. Jul 22, 2026 1:53 am

Option B Seen this on other clusters, you have to set ConstrainDevices=yes in cgroup.conf or SLURM won't restrict GPU access. The other options don't deal with device isolation directly.

TaylorB Jul 15, 2026 3:35 pm

Check the official guide or a lab environment, both show B and cgroup.conf for scenarios like this.

Reese Jul 25, 2026 11:41 pm

D . Modifying the job script to ask for CPUs and GPUs might help resource allocation, but unless the system actually enforces device access, users can still see all GPUs. Pretty sure that's a trap, but it seems logical at first.

Karan Q. Jul 18, 2026 9:42 am

D . If you specify both GPUs and CPUs in the script, you're telling the scheduler to allocate more resources per job so maybe there's better overall isolation. Not totally certain, but I think that helps reduce resource contention. Anyone disagree?

Ryan W. Jul 22, 2026 6:46 pm

B, saw a similar thing in the official guide and lab environments.

NinaP Jul 7, 2026 12:26 am

Maybe D, since adding more resource requests in the script can help isolation, but B is the real trap here.

SteadyLearner8262 Jul 22, 2026 12:55 am

Not D, pretty sure it has to be B. Only B (ConstrainDevices in cgroup.conf) actually restricts GPU device access at the OS level-changing the job script in D doesn't stop jobs from seeing all GPUs. D's a common distractor here.

Skyler H. Jul 25, 2026 7:11 am

Its B here, since enabling ConstrainDevices in cgroup.conf is exactly how you tell Slurm to restrict a user's job to only the GPUs it was allocated. None of the others actually enforce device access. I remember this coming up in official training material too, but if there’s a different method folks have used, let me know.

Karan A. Jul 22, 2026 6:46 pm

Totally agree, B. Setting ConstrainDevices=yes is how you stop jobs from hogging all GPUs on the node.

Jamie U. Jul 8, 2026 6:43 am

B , only ConstrainDevices in cgroup.conf actually locks jobs to just their assigned GPUs. The other options don't enforce that kind of device isolation at all. Seen this fix used in actual Slurm configs before, but let me know if anyone's tried something different.

Be respectful. No spam.

Correct Answer:

Explanation

The problem describes a lack of resource isolation where a user's job can access GPUs not allocated to it. This is solved by enforcing resource containment using Linux Control Groups (cgroups). In the Slurm workload manager, which is commonly used in HPC environments with NVIDIA GPUs, the cgroup.conf file manages this integration. Setting the ConstrainDevices parameter to yes instructs Slurm to leverage the cgroup devices subsystem. This subsystem creates a specific allowlist of device files for the job, ensuring it can only access the exact GPU devices assigned by the scheduler, thereby preventing resource contention.

Why Incorrect

A. Increasing memory allocation is unrelated to device access control and will not prevent a process from accessing unallocated GPU devices.

C. Modifying job priority only changes the scheduling order; it does not enforce resource isolation for jobs that are already running.

D. Requesting additional CPU cores does not impose any restrictions on which GPU devices the job is permitted to access.

References

1. Official Vendor Documentation (Slurm): The official Slurm documentation for the cgroup.conf file explicitly defines the ConstrainDevices parameter. It states, "If set to 'yes', Slurm will use the devices cgroup to constrain a job's access to only the devices allocated to it." This directly addresses the question's scenario.

Source: SchedMD, Slurm Workload Manager, cgroup.conf documentation.

Reference: Parameter description for ConstrainDevices at https://slurm.schedmd.com/cgroup.conf.html.

2. Official Vendor Documentation (NVIDIA): NVIDIA's deployment guides for DGX systems, which are reference architectures for AI infrastructure, recommend using cgroups for proper resource isolation in a Slurm environment. This confirms it as a best practice.

Source: NVIDIA DGX SuperPOD with NVIDIA DGX A100 Deployment Guide.

Reference: Section 4.2.1, "Slurm Configuration," discusses the setup of cgroup.conf for resource management.

3. University Courseware/Documentation: Reputable university High-Performance Computing (HPC) centers document this configuration as essential for multi-user GPU environments. For example, the University of Florida Research Computing documentation explains that cgroups are necessary for proper GPU isolation.

Source: University of Florida Research Computing, HiPerGator User Docs.

Reference: Section "GPU Isolation" on the "Using GPUs on HiPerGator" page.

Question 1 of 20

NVIDIA NCP AIO

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE

Free NCP-AIO Practice Test Questions and Answers (2026) | Cert Empire Practice Questions

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE