Yeah, C fits here. The main thing is that failover handles the automatic transfer when a node fails, not load balancing or manual steps. Option A trips people up since load balancing is also HA-related but it doesn't kick in when a node actually drops. Anyone disagree?
Q: 1
You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the
nodes in the cluster has failed, but the application remains available to users.
What mechanism is responsible for ensuring that the workload continues to run without
interruption?
Options
Discussion
Going With C is correct. Saw a nearly identical question in a mock-the failover mechanism takes over automatically with no user interruption. Data replication (D) helps with integrity but not the instant switch. Open to counterpoints if I missed something.
Probably C. Pretty sure that's the failover part, right? Can someone confirm I'm not missing something obvious here?
Its C, saw similar cluster HA questions in exam reports where failover is the main mechanism.
Be respectful. No spam.
Q: 2
You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand-
alone GPU-enabled server.
What must you complete before pulling the container? (Choose two.)
Options
Discussion
Maybe C and D. I'm thinking you need to have TensorFlow or PyTorch on the server, otherwise the container won't run properly, and logging into NGC with an API key is also mandatory for pulls. Not totally sure about skipping Docker install if it's standalone though. Anyone disagree?
You definitely need A and D for this setup. The server has to be ready with Docker plus the NVIDIA toolkit for GPU access, and you can't pull from NGC without logging in using an API key. I'm pretty sure that's all you need-let me know if there's some weird case I missed.
A and D
C or D? I think C makes sense since you'd want the framework installed first, but D also seems needed. Not sure, anyone else see this?
Be respectful. No spam.
Q: 3
A data scientist is training a deep learning model and notices slower than expected training times.
The data scientist alerts a system administrator to inspect the issue. The system administrator
suspects the disk IO is the issue.
What command should be used?
Options
Discussion
I remember a similar scenario from labs. in practice exams, pretty sure it's B here.
Option B since iostat directly shows disk IO performance, but if the workload is network-based or on a cloud volume, this might not catch everything. In some edge setups htop might hint at IO-wait spikes though.
Be respectful. No spam.
Q: 4
A system administrator wants to run these two commands in Base Command Manager.
main
showprofile device status apc01
What command should the system administrator use from the management node system shell?
Options
Discussion
C/D? Had something like this in a mock, pretty sure A is the one that works from shell without interactive mode.
I'm thinking B works too if cmsh actually takes -p for passing command strings, but I could be wrong.
C vs D, but is "best" asking for non-interactive only? If they want interactive then that changes things since D is a trap here.
Be respectful. No spam.
Q: 5
You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require
access to multiple GPUs across different nodes, but inter-node communication seems slow,
impacting performance.
What is a potential networking configuration you would implement to optimize inter-node
communication for distributed training?
Options
Discussion
For GPU-heavy distributed training, InfiniBand (D) seems like the best shot. It gives you way lower latency and much better bandwidth than Ethernet, which matters when syncing model weights all the time. Pretty sure that's what you'd see in most real-world AI clusters, but someone chime in if they've made B work at scale.
Definitely D here. InfiniBand is what you'd typically see in HPC clusters for distributed AI training because of its low latency and high bandwidth, which really matters more than jumbo frames on Ethernet. Pretty sure I saw a similar question in practice exams too. Agree?
Its D. InfiniBand is built for this sort of low-latency, high-throughput workload so it beats standard Ethernet networking here.
B
Be respectful. No spam.
Q: 6
You are managing an on-premises cluster using NVIDIA Base Command Manager (BCM) and need to
extend your computational resources into AWS when your local infrastructure reaches peak capacity.
What is the most effective way to configure cloudbursting in this scenario?
Options
Discussion
D imo, Cluster Extension is built for this. Options B and C both want you to do manual work which defeats the point of cloudbursting. A is a distractor, since BCM's load balancer alone can't handle auto-provisioning into AWS. Seen similar in practice questions, so pretty sure D is right.
C or D. I think D is right because Cluster Extension sounds like it would handle the provisioning part automatically, but I've only seen manual setups before so not totally sure. Anyone has hands-on with BCM cloudbursting?
Probably D, since BCM's Cluster Extension automates cloud resource provisioning instead of manual intervention. Not 100% sure but makes sense based on hybrid cluster management.
Be respectful. No spam.
Q: 7
You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of
GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as
display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?
Options
Discussion
Pretty sure it's A since Slurm relies on gres.conf to control which GPUs are visible for scheduling. Manual steps like nvidia-smi (B) or reinstalling drivers won’t restrict allocation, you have to exclude the display GPUs via config. Agree?
A tbh. Only config changes in gres.conf/slurm.conf will lock down allocation like this.
A D is a trap, only gres.conf and slurm.conf config can enforce the right GPU allocation for jobs.
Not D, A. The only way you actually prevent Slurm from touching display GPUs is excluding them in gres.conf, not by job script tweaks or reinstalling drivers.
Probably A here. Similar questions on official practice tests refer to correctly setting up gres.conf and slurm.conf so only the right GPUs get scheduled for jobs. Manual tools like nvidia-smi don't control Slurm's GPU selection. Official admin guide explains this pretty well.
Be respectful. No spam.
Q: 8
An organization has multiple containers and wants to view STDIN, STDOUT, and STDERR I/O streams
of a specific container.
What command should be used?
Options
Discussion
Option D I was thinking
docker inspect shows everything about the container, so you might find the logs there. Not 100% sure, maybe missing something obvious with inspect output. If anyone has tried this recently let me know.A is wrong, it's C. docker logs is what you use to check STDOUT/STDERR of a running container. docker inspect gives details but not the I/O streams directly.
C
Official docs and doing labs help with these command questions.
Be respectful. No spam.
Q: 9
You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.
To automate repetitive administrative tasks and efficiently manage resources across multiple nodes,
which of the following is essential when using the Run:AI Administrator CLI for environments where
automation or scripting is required?
Options
Discussion
C makes sense here, you need a kubeconfig with admin rights or the CLI just can't talk to all the nodes for automation. Pretty standard setup for K8s admin tools. If someone disagrees let me know, but I think that's spot on.
C saw this or something close on an exam report, admin kubeconfig is required for scripting with runai-adm.
Its C, you definitely need the kubeconfig with admin rights before automating anything with the CLI.
Probably C-can’t automate or script across nodes with Run:AI CLI unless you have a kubeconfig set with admin rights. B sounds tempting but doesn’t handle permissions. Seen similar confusion on practice tests.
C imo, since the CLI needs a kubeconfig with admin rights to hit the whole cluster. Without that config, automation with runai-adm just won’t work right. Pretty standard for Kubernetes tools, correct me if I’m missing something.
My vote is C. Admin kubeconfig is a must for cluster-wide automation with runai-adm CLI.
Be respectful. No spam.
Q: 10
A system administrator of a high-performance computing (HPC) cluster that uses an InfiniBand fabric
for high-speed interconnects between nodes received reports from researchers that they are
experiencing unusually slow data transfer rates between two specific compute nodes. The system
administrator needs to ensure the path between these two nodes is optimal.
What command should be used?
Options
Discussion
A
Yeah, for just connectivity testing ibping (C) is quick, but here you need to see the actual path between nodes. ibtracert shows all the hops in the InfiniBand fabric which helps spot if routing or cabling is off. Pretty sure A fits best for troubleshooting slow routes. Correct me if I missed something.
Yeah, for just connectivity testing ibping (C) is quick, but here you need to see the actual path between nodes. ibtracert shows all the hops in the InfiniBand fabric which helps spot if routing or cabling is off. Pretty sure A fits best for troubleshooting slow routes. Correct me if I missed something.
A tbh, ibtracert actually shows the full route packets take between two InfiniBand nodes. Needed here to find any bottlenecks along the path. D (ibnetdiscover) is great for mapping, but doesn't trace actual traffic flow. Anyone see it differently?
Not sure, but I’d pick D here. ibnetdiscover sounds like it’d show how the nodes are connected, so you could spot any weird topology affecting traffic. Maybe not the most direct, but it gives a map at least. Agree?
D , ibnetdiscover only maps the fabric topology, doesn't trace actual data paths like A does.
If the question wanted just a quick connectivity check instead of full path tracing, would C (ibping) be better than A?
Be respectful. No spam.
Question 1 of 20 · Page 1 / 2