View NCP-AIO Exam Questions

Q: 1

You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users. What mechanism is responsible for ensuring that the workload continues to run without interruption?

Options

Discussion

Aaron U. Feb 12, 2026 8:44 pm

C . Data replication (D) protects your data, but only the failover mechanism actually keeps the app running without service interruption. Some folks get tripped up by D here.

Casey K. Feb 17, 2026 10:40 pm

Going With C is correct. Saw a nearly identical question in a mock-the failover mechanism takes over automatically with no user interruption. Data replication (D) helps with integrity but not the instant switch. Open to counterpoints if I missed something.

PriyaX Mar 3, 2026 12:48 pm

Option C here. Failover is the key HA feature, not D, which is just about data integrity. Trap answer for sure.

MasonD Feb 28, 2026 11:37 pm

Option C, data replication (D) sounds tempting but it's really about data not app uptime.

PracticalLead7777 Feb 15, 2026 5:16 pm

C imo, this matches what the official guide says for HA clusters. Practice tests cover similar scenarios.

Robin Q. Feb 21, 2026 4:20 pm

Option D

RowanR Feb 20, 2026 3:51 pm

Honestly, I think D. Data replication feels like the key for making sure everything keeps working, since it protects against node loss. Seems like a trap for C here.

Taylor Feb 15, 2026 2:13 am

C here. The question points to the automatic failover doing the heavy lifting for zero downtime, not data replication or manual intervention. Pretty sure about this but open if someone else sees it differently.

Zoe Feb 23, 2026 7:31 am

I think C, saw something similar on a practice test. Failover is what keeps things running when a node drops. Agree?

Nina Feb 23, 2026 5:54 am

Makes sense to choose C. Failover is what keeps the app running if a node crashes, not data replication or manual admin.

Be respectful. No spam.

Correct Answer:

Explanation

A failover mechanism is a fundamental component of a high-availability (HA) cluster. Its primary function is to automatically detect the failure of a primary node and redirect its workload to a pre-configured standby or secondary node. This process ensures that the service or application experiences minimal to no downtime, maintaining operational continuity. The scenario described, where an application remains available after a node failure, is the direct result of a successful, automated failover operation.

Why Incorrect

A. Load balancing across all nodes in the cluster.

Load balancing distributes incoming requests among healthy nodes but does not manage the stateful transfer of a running workload from a failed node.

B. Manual intervention by the system administrator to restart services.

This contradicts the principle of automated high availability. Manual intervention would result in service interruption until the administrator takes action.

D. Data replication between nodes to ensure data integrity.

Data replication is a critical prerequisite for failover, ensuring the standby node has current data, but it is not the mechanism that executes the switchover.

References

1. Official Vendor Documentation: NVIDIA Cumulus Linux User Guide, Chapter: High Availability and Redundancy. This chapter details mechanisms like Virtual Router Redundancy (VRR) and Multi-Chassis Link Aggregation (MLAG), which are failover technologies designed to provide network service continuity by switching to redundant hardware in the event of a failure. This exemplifies the principle of automated failover in a production environment.

2. University Courseware: Patterson, D. A., & Hennessy, J. L. (2017). Computer Organization and Design RISC-V Edition: The Hardware Software Interface. Morgan Kaufmann. In Chapter 6, "Parallel Processors from Client to Cloud," the text discusses dependability via redundancy, explaining that a key technique for high availability is to use redundant nodes and "when one fails, the other can take over" (Section 6.8, p. 552). This process is defined as failover.

3. Peer-reviewed Academic Publication: A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "Basic Concepts and Taxonomy of Dependable and Secure Computing," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11-33, Jan.-Mar. 2004, doi: 10.1109/TDSC.2004.2. The paper defines fault tolerance as the ability to provide service continuity despite faults, which is achieved through error processing and fault treatment, the automated form of which is a failover mechanism.

Q: 2

You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand- alone GPU-enabled server. What must you complete before pulling the container? (Choose two.)

Options

Discussion

Chloe Feb 28, 2026 12:39 am

Docker setup and NGC login again? This pops up on every NVIDIA exam report. D imo, since you need the API key and registry auth, and A because without proper Docker plus the NVIDIA toolkit GPUs won't pass through. Pretty sure those are it, but chime in if you see a trick here.

JamieN Feb 21, 2026 7:15 pm

Makes sense to me that it's A and D. You need Docker with NVIDIA support to run GPU containers, and logging in with an NGC API key is required before pulling. Pretty sure that's what they're after, but correct me if I'm missing something.

Ravi O. Feb 17, 2026 8:19 pm

A or D. You don’t actually need to install TensorFlow or PyTorch first since the whole point is the container includes them, so C’s a trap. B isn’t needed on a stand-alone server either. Pretty sure it’s A and D but let me know if I missed something.

Kevin R. Mar 4, 2026 1:17 am

Maybe C and D. I'm thinking you need to have TensorFlow or PyTorch on the server, otherwise the container won't run properly, and logging into NGC with an API key is also mandatory for pulls. Not totally sure about skipping Docker install if it's standalone though. Anyone disagree?

Noah V. Mar 3, 2026 9:32 pm

A/D tbh, but if NGC switched auth to a different method or made public pulls possible for these images, D could flip. Haven't seen that yet but exam wording can get weird on prereqs.

Jordan X. Feb 27, 2026 2:54 am

Why do some still pick C? Standard NGC containers come preloaded, no manual install needed before pulling, right?

Sara P. Mar 3, 2026 10:46 pm

A/D tbh, B is a distractor and C isn't needed if you're using standard NGC framework containers.

SharpSec3490 Mar 3, 2026 5:11 pm

A and D are the must-haves here. Docker with the NVIDIA toolkit lets you run GPU containers, and NGC API key login is needed to pull images. No need to preinstall frameworks for standard NGC containers. Pretty sure that's right but open to input.

Mason M. Feb 15, 2026 4:18 am

A/D. Pre-reqs are just Docker with NVIDIA toolkit and NGC login, nothing more for standard containers.

Jack W. Feb 13, 2026 10:36 pm

Not C, it's A and D. Installing frameworks manually is a trap here since NGC containers already have them baked in.

Be respectful. No spam.

Correct Answer:

A, D

Explanation

To deploy a GPU-accelerated container from the NVIDIA NGC registry on a standalone server, two primary prerequisites must be met. First, the server requires a container runtime (Docker) and the NVIDIA Container Toolkit. The toolkit enables the Docker runtime to expose the host's NVIDIA GPUs to the container. Second, the NGC registry (nvcr.io) requires authentication. This is accomplished by generating a personal API key from the NGC website and using it with the docker login command to authenticate the session before attempting to pull a container image.

Why Incorrect

B. Setting up a Kubernetes cluster is for container orchestration across multiple nodes and is not a requirement for running a container on a single, standalone server.

C. The primary benefit of using NGC containers is that they come pre-packaged with frameworks like TensorFlow or PyTorch, eliminating the need for manual installation on the host system.

References

1. NVIDIA NGC Documentation, "Getting Started with NGC Containers": This guide outlines the prerequisites for running NGC containers. It explicitly lists installing the NVIDIA Driver, Docker, and the NVIDIA Container Toolkit as necessary setup steps. This directly supports option A. The guide also details the procedure for logging into the NGC container registry using an API key, which supports option D.

Reference: See the "Setting Up Your System" and "Log in to the NGC Container Registry" sections in the official NGC Getting Started guide.

2. NVIDIA Cloud-Native Documentation, "NVIDIA Container Toolkit Installation Guide": This document explains the purpose of the toolkit: "The NVIDIA Container Toolkit allows users to build and run GPU accelerated containers." It is a mandatory component for enabling GPU access from within a Docker container.

Reference: See the "Introduction" section of the NVIDIA Container Toolkit documentation.

3. NVIDIA NGC Documentation, "NGC Registry User Guide": This document details the authentication process. It states, "To download containers from the NGC registry, you must have Docker installed and you must log in to the NGC registry." It then provides the exact docker login nvcr.io command and explains the use of the NGC API Key for authentication.

Reference: See the "Logging In to the NGC Registry" section.

Q: 3

A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue. What command should be used?

Options

Discussion

Liam Feb 18, 2026 10:59 pm

B . iostat is literally made for monitoring disk IO, which is exactly what the sysadmin suspects here. tcpdump would be for network, nvidia-smi for GPU, htop for processes and CPU. Pretty straightforward unless there's a trick I'm missing.

Chloe K. Feb 24, 2026 2:27 pm

B . iostat is specific for checking disk IO bottlenecks. D (htop) might look helpful but it's more CPU/mem focused, not IO stats. Saw similar question in a practice test and the trap is picking D instead of B.

Jordan L. Feb 18, 2026 6:44 am

Nah, it's not D here-B is the tool for disk IO issues. htop tricks a lot of folks since it looks flashy, but it won't give detailed block device stats like iostat does. Pretty sure B is correct, anyone disagree?

Owen Feb 25, 2026 7:07 pm

B is the way to go here, since iostat directly shows disk IO bottlenecks. I remember seeing a similar scenario pop up in an exam simulation and it was always about matching the tool to the suspected hardware issue. The other options don't really give you block device stats. Pretty sure B is right, agree?

Ethan A. Mar 5, 2026 1:02 am

I actually thought D (htop) because it shows system resource usage in real-time, including IO wait. But now realizing it doesn’t break down disk IO specifics like iostat does. Maybe I’m missing something, but htop was my first guess. Disagree?

CuriousAnalyst2045 Feb 28, 2026 6:15 am

Its B for this, but if the model data was being loaded over NFS or some network mount, you'd probably need something like tcpdump or nload to really see network IO impact, not just disk. With normal local disks though, iostat is what you want. Anyone see an edge case where htop would matter more?

Noah Mar 1, 2026 7:49 pm

Call it B for this one. iostat gives disk IO stats directly, which is what you'd want since the admin suspects IO bottlenecks. tcpdump and nvidia-smi don't fit here. Pretty confident but open if anyone disagrees.

Quinn X. Feb 18, 2026 1:11 am

I remember a similar scenario from labs. in practice exams, pretty sure it's B here.

Skyler Q. Feb 18, 2026 7:58 pm

Option B since iostat directly shows disk IO performance, but if the workload is network-based or on a cloud volume, this might not catch everything. In some edge setups htop might hint at IO-wait spikes though.

Nora Feb 21, 2026 10:12 pm

B imo, iostat is made for monitoring disk IO directly. The others don't give block device stats so you'd miss the actual bottleneck. If it was a GPU thing then C might fit, but for disk issues it's gotta be B. Anyone see a reason to pick something else?

Be respectful. No spam.

Correct Answer:

Explanation

The system administrator's hypothesis is that a disk I/O bottleneck is causing the performance degradation. The iostat (input/output statistics) command is the standard Linux utility designed specifically to monitor system I/O device loading. It reports on CPU statistics as well as I/O statistics for block devices (i.e., disks). Key metrics provided by iostat, such as %util (device utilization), await (average time for I/O requests), and r/s / w/s (reads/writes per second), allow an administrator to directly diagnose whether the storage subsystem is saturated and causing a bottleneck for the data loading pipeline of the deep learning model.

Why Incorrect

A. tcpdump: This is a network packet analyzer used for monitoring and troubleshooting network traffic, not disk I/O performance.

C. nvidia-smi: This utility monitors NVIDIA GPU status, including utilization, memory usage, and temperature. It cannot diagnose disk-related bottlenecks.

D. htop: This is an interactive process viewer that primarily shows CPU and memory usage. It offers limited, high-level I/O information but lacks the detailed device-specific metrics iostat provides.

References

1. NVIDIA Documentation: The NVIDIA Deep Learning Performance Guide discusses identifying system-level bottlenecks, including the input pipeline. It states, "The first step in optimizing deep learning training performance is to ensure the GPUs are fully utilized... If they are not, the bottleneck is likely in the input data pipeline or the CPU." Diagnosing an input pipeline bottleneck on the system side requires tools like iostat.

Source: NVIDIA Deep Learning Performance Guide, "Is the Bottleneck the Input Pipeline?" section.

2. University Courseware: University courses on system administration and performance tuning identify iostat as the primary tool for this task. For example, course materials for high-performance computing often cover system-level profiling.

Source: University of Illinois at Urbana-Champaign, CS 433: Parallel Computer Architecture, Lecture 25 - "Measuring Performance," Slide 11. This lecture slide lists iostat as a key tool for measuring I/O performance.

3. Official Vendor Documentation (Linux): The official manual page for iostat defines its purpose clearly.

Source: iostat(1) Linux manual page. "The iostat command is used for monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates." This is available on any standard Linux distribution.

Q: 4

A system administrator wants to run these two commands in Base Command Manager. main showprofile device status apc01 What command should the system administrator use from the management node system shell?

Options

Discussion

Chloe Q. Feb 23, 2026 11:05 am

B looks close since -p is there, but I thought -p was mostly for specifying a profile or path, not chaining commands together. Shouldn't it be used only when you need to select something specific to run inside cmsh? Correct me if that's off.

Sam M. Feb 25, 2026 1:25 pm

A is right here, since cmsh -c will run both commands as a single string and then exit, which matches what the system admin wants. Haven't seen cmsh-system used like in D. If anyone's seen something different, let me know.

Sanjay O. Feb 28, 2026 8:25 pm

My pick: it's A here since cmsh -c lets you run both commands in sequence straight from the shell, non-interactively. The -p flag in B isn't right for passing multiple commands, and the other options don't really line up with Base Command Manager syntax. If anyone's seen cmsh accept -p for something like this, let me know, but I doubt it.

Reese I. Feb 26, 2026 4:01 pm

C/D? Had something like this in a mock, pretty sure A is the one that works from shell without interactive mode.

Amelia Feb 23, 2026 9:14 pm

HelpfulLead6673 Feb 26, 2026 7:26 am

Seen similar on practice tests and official docs, always A. Check the admin guide for the right cmsh flags if unsure.

Robin Feb 16, 2026 4:07 am

Not B, A does it. The -c switch lets you chain both commands in a single shell call, just like the question wants. -p is for path only, so can't be used here. Pretty sure that's the right syntax but happy to hear if I'm missing something.

Nina Q. Mar 3, 2026 12:52 pm

I don’t think it’s A. B looks right since -p could be for path or profile, so it might let you run sequences like that too.

Jason Q. Feb 26, 2026 10:49 pm

A is wrong, use official admin guide or lab it in Base Command Manager for this syntax.

Sofia L. Feb 27, 2026 3:02 pm

B , since -p could let you set the path and then run the commands in sequence. Maybe I'm missing a syntax thing but looks close to what you'd need. Happy to be corrected if I'm off here.

Be respectful. No spam.

Q: 5

You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require access to multiple GPUs across different nodes, but inter-node communication seems slow, impacting performance. What is a potential networking configuration you would implement to optimize inter-node communication for distributed training?

Options

Discussion

Amelia Q. Feb 14, 2026 12:11 pm

Anyone used official NVIDIA docs or hands-on labs for cluster networking configs on this? Practice exams seem to push D but I've seen B suggested too.

Kevin S. Feb 25, 2026 6:27 pm

D. but if the question said you couldn't upgrade hardware, B might've been right. All about whether new gear's allowed or not.

Piya B. Mar 1, 2026 1:17 am

D here. InfiniBand cuts latency and boosts bandwidth, which is key for distributed training jobs across nodes. B is tempting, but jumbo frames just tweak Ethernet, not the same performance jump. Pretty sure D's what they're looking for.

Jack Q. Feb 18, 2026 7:27 pm

B , saw something like this on a practice set. Jumbo frames can help with large data transfers if you're on Ethernet.

Neha Feb 14, 2026 3:51 am

Maybe B. Jumbo frames get mentioned a lot for high data throughput, especially if you're stuck on Ethernet instead of InfiniBand.

Casey Feb 13, 2026 12:04 am

Why is nobody talking about C? Does a dedicated storage network really help distributed training comms vs InfiniBand?

Amelia J. Feb 13, 2026 9:22 am

Casey W. Feb 22, 2026 3:54 am

B or D? I was thinking B since enabling jumbo frames helps with larger AI job traffic, and not every cluster has InfiniBand installed out of the box. Pretty sure D's the high-perf answer if hardware can be upgraded, but for existing Ethernet setups, B is a common tweak. Anyone prefer B in practice?

Ben A. Mar 1, 2026 4:16 pm

InfiniBand is the key upgrade here, so D. It directly targets the latency and bandwidth issues common in distributed training jobs, whereas B (jumbo frames) only tweaks Ethernet but can't match InfiniBand performance. Pretty sure D is right unless there's a restriction on hardware changes.

Ethan Feb 25, 2026 7:47 am

D is right here since InfiniBand is built for this kind of low-latency, high-throughput traffic between nodes, perfect for distributed AI training. B's a common trap if you're thinking Ethernet only, but nothing in the question says you can't use better hardware. Pretty sure about D but open to pushback.

Be respectful. No spam.

Correct Answer:

Explanation

InfiniBand is a high-performance computing (HPC) interconnect that provides significantly higher throughput and lower latency than standard Ethernet. For distributed AI training, where frequent and large-volume gradient exchanges occur between nodes, network performance is critical. InfiniBand utilizes Remote Direct Memory Access (RDMA), which allows GPUs on different nodes to communicate directly, bypassing the CPU and kernel network stack. This minimizes communication overhead and latency, preventing GPUs from idling while waiting for data, thereby directly addressing the performance bottleneck in multi-node training scenarios.

Why Incorrect

A. Increasing job replicas scales out the number of independent training instances but does not improve the communication performance within a single, distributed training job.

B. While jumbo frames can slightly reduce packet overhead on Ethernet, this interconnect still has fundamentally higher latency and lower bandwidth compared to InfiniBand for HPC workloads.

C. A dedicated storage network optimizes the I/O for loading datasets from storage, not the inter-node, inter-GPU communication required for the training algorithm's synchronization steps.

References

1. NVIDIA Corporation. (2020). NVIDIA DGX A100 System Architecture Whitepaper.

Reference: Page 13, Section "Networking for Scale-Out".

Quote: "For scaling out, the DGX A100 system has eight single-port Mellanox ConnectX-6 VPI HDR InfiniBand adapters... These provide 200 Gb/s of bandwidth per adapter for clustering. This massive bandwidth is required to feed the multi-GPU training jobs that run on DGX A100." This document establishes InfiniBand as the standard, high-performance interconnect for scaling out NVIDIA's flagship AI systems.

2. NVIDIA Developer Documentation. (2023). NVIDIA Collective Communication Library (NCCL) Documentation.

Reference: Introduction / Overview Section.

Content: The documentation states that NCCL is "optimized to achieve high bandwidth and low latency... over NVIDIA Mellanox InfiniBand and Ethernet networking for multi-node." This confirms that the core library for distributed training is specifically optimized for high-speed interconnects like InfiniBand.

3. Sato, K., et al. (2021). Demystifying the Performance of HPC Cloud for Deep Learning. 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

Reference: Section IV-B, "Inter-node Communication".

DOI: https://doi.org/10.1109/IPDPSW52791.2021.00090

Content: The study analyzes communication performance for deep learning and notes that interconnects supporting RDMA, such as InfiniBand, provide significantly higher performance for distributed training by reducing communication overhead compared to traditional TCP/IP over Ethernet.

Q: 6

You are managing an on-premises cluster using NVIDIA Base Command Manager (BCM) and need to extend your computational resources into AWS when your local infrastructure reaches peak capacity. What is the most effective way to configure cloudbursting in this scenario?

Options

Discussion

Meera Mar 1, 2026 1:03 am

D is the right call here. Cluster Extension in BCM automates the whole cloudbursting process, so you don't have to manually spin up AWS nodes like in B. That's what most of the official guides point toward for efficiency.

FriendlyNeteng729 Feb 26, 2026 3:05 pm

This comes down to what "most effective" really means here. D is right since BCM's Cluster Extension actually automates spinning up AWS nodes only when you hit the local limit, so it saves time and manual effort compared to option B. But if you had some advanced compliance or network config that BCM automation can't handle, B could technically be safer in rare cases. For most setups though, I'm pretty sure D is what they want.

Riley M. Feb 17, 2026 7:31 pm

Call it it's D since BCM's Cluster Extension actually automates the provisioning of AWS nodes when local capacity maxes out. Manual options like B work but aren't the most effective for seamless cloudbursting. Open to other views if anyone’s seen different on practice exams.

Rowan X. Feb 23, 2026 2:07 am

D imo, Cluster Extension is built for this. Options B and C both want you to do manual work which defeats the point of cloudbursting. A is a distractor, since BCM's load balancer alone can't handle auto-provisioning into AWS. Seen similar in practice questions, so pretty sure D is right.

Adam W. Mar 1, 2026 12:57 pm

B . Manual provisioning in AWS gives you full control over the new cloud nodes, which could help prevent unexpected behavior. Not totally sure though, since automation is nice for speed.

SteadyCandidate3000 Mar 2, 2026 1:54 am

D imo. Cluster Extension handles auto cloud resource provisioning with BCM, so it's less work than manual options like B. Pretty sure that's what "most effective" means for these hybrid setups. Let me know if you disagree.

PreciseConsultant4394 Feb 20, 2026 3:38 am

I saw a similar question on a practice test, and I was debating between B and D. Isn’t manually provisioning in AWS (B) more reliable if you want tighter control? BCM’s automation sounds good but can be unpredictable sometimes, right?

CalmLearner6009 Feb 17, 2026 6:04 am

Why not B? Manually spinning up AWS nodes lets you control timing and which resources are used, so I’d think it could be the most effective if you want to avoid surprises.

Jordan X. Feb 22, 2026 10:33 am

Nah, it's not B here. D is right since Cluster Extension automates cloudbursting, while B is way too manual for "most effective".

Adam L. Feb 14, 2026 12:19 pm

Had something similar in a mock recently and the pick was D. Cluster Extension is built for automatic cloudbursting so you don't have to manage node spin-up manually. Pretty sure that's what they mean by most effective here, but let me know if you see it differently.

Be respectful. No spam.

Correct Answer:

Explanation

NVIDIA Base Command Manager (BCM) is designed to simplify the management of hybrid HPC and AI clusters. Its "Cluster Extension" feature provides an automated, policy-driven mechanism for cloudbursting. When the on-premises Slurm queue has pending jobs and local compute resources are fully utilized, BCM automatically provisions pre-configured nodes in a public cloud like AWS. These cloud nodes are seamlessly joined to the on-premises cluster to run the workloads. Once the jobs are complete, BCM automatically de-provisions the cloud resources, optimizing for cost and efficiency. This automation makes it the most effective method.

Why Incorrect

A. BCM requires significant pre-configuration to integrate with a cloud provider; it is not a zero-configuration process. Also, its purpose is bursting, not continuous even distribution.

B. Manual provisioning is slow, prone to human error, and defeats the purpose of an automated management platform like BCM, making it highly ineffective for dynamic scaling.

C. A manual switchover to a standby deployment is a disaster recovery or migration strategy, not an efficient, on-demand cloudbursting method for handling peak loads.

References

1. NVIDIA Base Command Platform User Guide (Version 23.10), Chapter 10: Using a Hybrid On-Premises and Cloud Cluster. The guide states, "When there are jobs in the queue and no on-premises nodes are available, BCM will automatically provision nodes in the cloud to run the jobs. When the jobs are complete, the nodes in the cloud are terminated." This directly supports the automated process described in option D.

2. NVIDIA Base Command Manager Deployment Guide (Version 23.10), Chapter 11: Configuring a Hybrid On-Premises and Cloud Cluster. This chapter details the extensive configuration required for cloud integration (e.g., cloud credentials, network setup, node images), which directly refutes the "without any pre-configuration" claim in option A.

Q: 7

You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering. How would you ensure that only the intended GPUs are allocated to jobs?

Options

Discussion

Hannah S. Feb 20, 2026 3:24 am

Makes sense to pick A. Direct config in gres.conf and slurm.conf is how GPU allocation is actually controlled with Slurm.

Emma Feb 28, 2026 4:21 pm

A. seeing this in recent exam reports too, configs through gres.conf are what actually restricts which GPUs Slurm uses.

Zoe T. Mar 4, 2026 10:38 pm

Same, I'd pick A here. Only listing the right GPUs in gres.conf actually controls which ones Slurm will allocate. The other choices don’t prevent jobs from landing on display GPUs. Pretty sure that's what the exam wants.

Maya L. Feb 18, 2026 10:12 pm

Option A Official Slurm admin guide and most practice exams highlight gres.conf config for this. Labs emphasize it too.

SeasonedDev5611 Feb 23, 2026 8:20 pm

A , official Slurm docs and admin labs push configuring gres.conf and slurm.conf for this exact scenario.

Hannah Feb 17, 2026 6:22 pm

A tbh, but watch out-if gres.conf lists a GPU ID that’s swapped by the OS (like after reboot), Slurm might still allocate the wrong one. Seen that tripping people up in similar exam questions. Anyone disagree?

Sean Feb 20, 2026 7:19 pm

A imo, saw this on a similar practice exam. Official guide covers gres.conf usage.

Leo Feb 27, 2026 5:07 am

Pretty sure it's A since Slurm relies on gres.conf to control which GPUs are visible for scheduling. Manual steps like nvidia-smi (B) or reinstalling drivers won’t restrict allocation, you have to exclude the display GPUs via config. Agree?

SaraU Feb 19, 2026 12:51 am

Is there any scenario where B would even work reliably? I always thought nvidia-smi assignments can't enforce job-level GPU isolation the way Slurm's gres.conf does. Seems like manual GPU allocation just isn't scalable for cluster setups.

Luna F. Feb 16, 2026 3:02 pm

A tbh, D feels like a trap since just increasing GPU requests doesn’t solve the config issue long-term. A is the standard way via gres.conf and slurm.conf control, but if anyone’s had success with another method let me know.

Be respectful. No spam.

Correct Answer:

Explanation

The standard and correct method for managing specific GPU device allocation in Slurm is through its Generic Resource (GRES) configuration. The slurm.conf file is used to declare that a node has GPUs available as a resource. The gres.conf file is then used to specify precisely which GPU devices on that node should be managed by Slurm. To exclude a GPU (e.g., one used for display), the administrator must explicitly list only the desired compute GPUs in gres.conf using their device files (e.g., /dev/nvidia0), indices, or UUIDs. GPUs not listed in this file will be ignored by the Slurm scheduler and will not be allocated to jobs.

Why Incorrect

B. nvidia-smi is a monitoring and management utility. It does not integrate with the Slurm scheduler to control job resource allocation, which is determined by Slurm's own configuration.

C. The problem described is a resource allocation policy issue, not a hardware detection failure. Reinstalling drivers will not change which GPUs Slurm is configured to manage.

D. Requesting more GPUs in a job script does not influence the scheduler's choice of which specific devices to allocate from the pool of GPUs it manages.

References

1. Official Slurm Documentation (SchedMD): The gres.conf manual page explicitly details how to configure specific devices. The configuration Name=gpu File=/dev/nvidia[0-6] on an 8-GPU node would make /dev/nvidia7 unavailable to Slurm.

Source: SchedMD LLC, gres.conf - Slurm configuration file for generic resource management., Section: "EXAMPLES". (URL: https://slurm.schedmd.com/gres.conf.html)

2. Official Slurm Documentation (SchedMD): The slurm.conf manual page describes the Gres parameter within a Node definition, which is required to associate the node with the resources defined in gres.conf.

Source: SchedMD LLC, slurm.conf - Slurm configuration file., Section: "NodeName", Parameter: "Gres". (URL: https://slurm.schedmd.com/slurm.conf.html)

3. NVIDIA Official Documentation: The NVIDIA Base Command Platform Deployment Guide provides a concrete example of configuring slurm.conf and gres.conf for GPU nodes, demonstrating the linkage between the two files for proper resource management.

Source: NVIDIA Base Command Platform On-Premises Deployment Guide, Version 23.08, Chapter 4: "Slurm Configuration", Section: "Configure Slurm". This section shows example configurations for both files.

Q: 8

An organization has multiple containers and wants to view STDIN, STDOUT, and STDERR I/O streams of a specific container. What command should be used?

Options

Discussion

Casey Feb 27, 2026 5:16 pm

C . Only docker logs gives you STDOUT and STDERR from the container, the others are for stats, process info or config details. Not 100 percent on the STDIN part but C matches most exam reports.

Anita Feb 17, 2026 10:13 am

C . Only docker logs CONTAINER-NAME will show you STDOUT and STDERR for a specific container, that's what they're after here.

Aaron B. Feb 20, 2026 12:48 am

C tbh, "inspect" (D) looks good at first but that's more config info. Only C actually displays the output streams requested. They probably put STDIN in as a trick.

JackD Feb 22, 2026 11:41 pm

Yeah C, docker logs CONTAINER-NAME is how you view STDOUT/STDERR for a container. Pretty sure that's what they're after here.

Jack B. Feb 13, 2026 8:02 pm

Option D I was thinking docker inspect shows everything about the container, so you might find the logs there. Not 100% sure, maybe missing something obvious with inspect output. If anyone has tried this recently let me know.

Liam Feb 21, 2026 1:18 am

C all day, docker logs is how you pull STDOUT/STDERR from a running container. The others don't actually show the output, just stats or info. Not 100% on STDIN but I think this is what they're after. Agree?

Karan R. Feb 22, 2026 5:07 pm

Not sure I agree with everyone jumping straight to C. D is tempting since inspect gives a ton of details, but only C (docker logs) shows STDOUT and STDERR by default. Maybe "STDIN" in the question is a trap. Going with C.

Morgan J. Mar 2, 2026 1:19 am

Its C, docker logs lets you check STDOUT/STDERR for the container. Other commands won't show those output streams directly.

Cameron J. Feb 23, 2026 8:31 pm

Not B, C. Only docker logs can show STDOUT and STDERR output for a specific container.

Amelia Feb 23, 2026 8:52 pm

C, Not totally sure because if they wanted to see all streams live including STDIN, maybe something else is needed, but for STDOUT/STDERR logs docker logs does the job. Anyone disagree?

Be respectful. No spam.

Correct Answer:

Explanation

The docker logs command is the standard utility for fetching the logs of a container. It aggregates and displays the standard output (STDOUT) and standard error (STDERR) streams generated by the main process running inside the specified container. This allows an administrator or developer to view the real-time or historical output for debugging and monitoring purposes, directly addressing the requirement to view the container's I/O streams.

Why Incorrect

docker top CONTAINER-NAME: This command lists the running processes inside a container, similar to the top command in Linux, but does not show the I/O streams.

docker stats CONTAINER-NAME: This command provides a live stream of a container's resource usage statistics, such as CPU and memory, not its application logs.

docker inspect CONTAINER-NAME: This command returns detailed, low-level metadata about a container's configuration in JSON format, not the application's output.

References

1. Official Docker Documentation, docker logs command reference: "The docker logs command fetches the logs of a container. The command shows the STDOUT and STDERR of the main process of the container."

Source: Docker Docs, engine/reference/commandline/logs/

2. Official Docker Documentation, docker top command reference: "Display the running processes of a container."

Source: Docker Docs, engine/reference/commandline/top/

3. MIT Missing Semester of Your CS Education, Courseware: "To see the output of the command that was run in the container, you can use docker logs . Each time you run the command, it will print the full output from the command."

Source: MIT CSAIL, Missing Semester, 2020, "Containers", Section: "Docker".

4. Official Docker Documentation, docker stats command reference: "Display a live stream of container(s) resource usage statistics."

Source: Docker Docs, engine/reference/commandline/stats/

Q: 9

You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI. To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?

Options

Discussion

Drew C. Feb 18, 2026 7:05 pm

Option C flips things here. If you don't have cluster-admin rights in your kubeconfig, the Run:AI CLI basically can't automate anything meaningful across nodes. Saw a similar catch on a practice quiz, so pretty sure that's what they're looking for. Happy to be challenged if someone has made scripting work without it.

Ajay W. Feb 15, 2026 5:58 pm

Makes sense to pick C. Without a kubeconfig set with admin rights, the CLI can't automate tasks cluster-wide. Pretty sure that's essential for scripting in these setups, unless I'm missing something.

HelpfulAnalyst7315 Feb 23, 2026 10:09 am

C. saw this in similar practice sets. Official docs and admin guides stress kubeconfig with cluster-admin for automation.

CalmNeteng5406 Feb 18, 2026 12:16 pm

C . Nothing works without cluster-admin kubeconfig if you're automating Run:AI at scale.

Nora B. Feb 17, 2026 4:02 pm

Definitely C here. You can't automate Run:AI admin tasks across the cluster unless your kubeconfig is set up with cluster-admin rights.

Piya D. Mar 1, 2026 5:07 am

I'd actually go with B. Allocating specific GPUs per job just feels more hands-on for managing resources, especially in scripts.

Sofia L. Feb 24, 2026 6:49 pm

C makes sense here, you need a kubeconfig with admin rights or the CLI just can't talk to all the nodes for automation. Pretty standard setup for K8s admin tools. If someone disagrees let me know, but I think that's spot on.

PracticalMentor1884 Feb 16, 2026 12:11 pm

C saw this or something close on an exam report, admin kubeconfig is required for scripting with runai-adm.

Grace W. Feb 26, 2026 5:35 pm

Its C, you definitely need the kubeconfig with admin rights before automating anything with the CLI.

Grace T. Feb 26, 2026 10:43 am

Its C for sure. For scripting or automation you need kubeconfig with cluster-admin rights or the CLI just won't function.

Be respectful. No spam.

Correct Answer:

Explanation

The Run:AI Administrator Command-Line Interface (runai-adm) interacts directly with the Kubernetes API to manage the Run:AI control plane components and resources. To perform cluster-wide administrative tasks such as creating Projects, managing users, or configuring system-wide settings, the CLI requires authenticated and authorized access. This is achieved through a Kubernetes configuration file (kubeconfig) that contains credentials with cluster-admin privileges. Without these administrative rights, the API server will reject the requests, rendering the CLI non-functional for its intended purpose and making automation impossible.

Why Incorrect

A. The runai-adm CLI is a high-level abstraction for Run:AI objects but still relies on authenticated communication with the Kubernetes API server; it does not bypass it or replace kubectl for all node-level tasks.

B. Manually allocating specific GPUs is a granular control feature, not an essential prerequisite for automation. A key benefit of Run:AI is automating the scheduling and allocation of GPU resources.

D. The Run:AI CLI is a cross-platform tool available for Windows, Linux, and macOS. Scripting and automation can be performed on any supported operating system, not just Windows.

References

1. NVIDIA Run:AI Documentation, "Install the Run:AI Command-Line Interface": This official guide specifies the prerequisites for installing and using the administrator CLI. It states, "To install and use the Run:AI administrator CLI you must have a Kubernetes configuration file with cluster administrative rights." This directly supports the necessity of a properly configured kubeconfig file with administrative privileges. (Found in the Administrator Setup section of the official Run:AI product documentation).

2. Kubernetes Documentation, "Organizing Cluster Access Using kubeconfig Files": This document explains that tools interacting with a Kubernetes cluster, like the Run:AI CLI, use kubeconfig files to find cluster information and credentials. For administrative tasks, the user or service account specified in the config must be bound to a role with sufficient permissions, such as cluster-admin. (See sections on Contexts and Users).

Q: 10

A system administrator of a high-performance computing (HPC) cluster that uses an InfiniBand fabric for high-speed interconnects between nodes received reports from researchers that they are experiencing unusually slow data transfer rates between two specific compute nodes. The system administrator needs to ensure the path between these two nodes is optimal. What command should be used?

Options

Discussion

Luke I. Feb 20, 2026 4:29 am

Option A

Ishaan I. Feb 25, 2026 7:35 am

A . The question wants to find the actual path packets take between two nodes, which is exactly what ibtracert does-like traceroute for InfiniBand. D (ibnetdiscover) just gives you the general topology, not the node-to-node path. Easy trap there if you're not careful! Open to any counterpoints but pretty sure it's A.

Owen B. Feb 25, 2026 8:44 pm

Option A

ArjunO Mar 1, 2026 11:36 am

A . Official guide and hands-on lab examples both point to ibtracert for path analysis between nodes, not just overall mapping.

Ishaan P. Feb 24, 2026 6:09 am

I've seen similar in exam reports, it's A for node-to-node path checks.

Layla O. Feb 12, 2026 8:04 pm

A
Yeah, for just connectivity testing ibping (C) is quick, but here you need to see the actual path between nodes. ibtracert shows all the hops in the InfiniBand fabric which helps spot if routing or cabling is off. Pretty sure A fits best for troubleshooting slow routes. Correct me if I missed something.

Luke N. Feb 27, 2026 11:28 pm

Why not D? It looks like a topology discovery trap, but for tracing specific node-to-node paths A is what you want.

ChrisC Feb 21, 2026 6:07 pm

Maybe A. Had something like this in a mock and ibtracert was needed for actual path tracing not just general status.

DirectReviewer6820 Feb 13, 2026 8:07 am

A tbh. D is tempting but it's more for topology, not node-to-node path trace.

Ajay Feb 28, 2026 3:33 pm

Don’t think it’s B or D. A is the only one that actually traces the path between two hosts, which is what you want for slow data between specific nodes. D gives you the whole topology, but not hop-by-hop between nodes. Correct me if I’m off here.

Be respectful. No spam.

Correct Answer:

Explanation

The ibtracert command is the InfiniBand equivalent of the TCP/IP traceroute utility. It is specifically designed to discover and display the path that packets take from a source port to a destination port within the InfiniBand fabric. By tracing the hops (switches) between the two specified compute nodes, a system administrator can verify the exact route, identify potential bottlenecks, and determine if the path is sub-optimal. This directly addresses the need to ensure the path is optimal and troubleshoot slow transfer rates between two specific endpoints.

Why Incorrect

B. ibstatus: This command queries the status of the local Host Channel Adapter (HCA) and does not provide any information about paths to other nodes in the fabric.

C. ibping: This tool verifies basic point-to-point connectivity and measures latency, but it does not reveal the network path taken between the two nodes.

D. ibnetdiscover: This command performs a full scan of the fabric to discover and display the entire network topology, which is not specific to a single path between two nodes.

References

1. NVIDIA Mellanox OFED for Linux User Manual (LTS v5.8-1.1.2.1), Chapter 10: Diagnostic Tools, Section 10.10, "ibtracert". The manual states, "ibtracert uses SMPs to trace the path from a source to a destination... It discovers the path and prints the hops from the source to the destination." This confirms its function for path tracing.

2. Ohio Supercomputer Center (OSC) Documentation, "Verifying InfiniBand Fabric". This university-affiliated supercomputing center guide explains the usage of InfiniBand diagnostic tools. It describes ibtracert as the command to "trace the path from the current node to another node in the fabric," validating its use for the scenario described.

3. Linux man pages (infiniband-diags), ibtracert(8). The official man page describes the utility's purpose: "trace the path from a source GID/LID to a destination GID/LID." This is the primary function required by the administrator in the question.

Question 1 of 20 · Page 1 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE