You are managing an on-premises cluster using NVIDIA Base Command Manager (BCM) and need to extend your computational resources into AWS when your local infrastructure reaches peak capacity.
What is the most effective way to configure cloudbursting in this scenario?
You are a Solutions Architect designing a data center infrastructure for a cloud-based AI application that requires high-performance networking, storage, and security. You need to choose a software framework to program the NVIDIA BlueField DPUs that will be used in the infrastructure. The framework must support the development of custom applications and services, as well as enable tailored solutions for specific workloads. Additionally, the framework should allow for the integration of storage services such as NVMe over Fabrics (NVMe-oF) and elastic block storage.
Which framework should you choose?
A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.
What role should you assign them in Run:ai?
A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.
Why would generating core dumps be a critical step in troubleshooting this issue?
An administrator wants to check if the BlueMan service can access the DPU.
How can this be done?
You are setting up a Kubernetes cluster on NVIDIA DGX systems using BCM, and you need to initialize the control-plane nodes.
What is the most important step to take before initializing these nodes?
A cloud engineer is looking to provision a virtual machine for machine learning using the NVIDIA Virtual Machine Image (VMI) and Rapids.
What technology stack will be set up for the development team automatically when the VMI is deployed?
A system administrator needs to collect the information below:
GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring
What single tool should be used?
An administrator is troubleshooting a bottleneck in a deep learning run time and needs consistent data feed rates to GPUs.
Which storage metric should be used?
What is the primary purpose of assigning a provisioning role to a node in NVIDIA Base Command Manager (BCM)?
You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.
To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?
A system administrator is troubleshooting a Docker container that is repeatedly failing to start. They want to gather more detailed information about the issue by generating debugging logs.
Why would generating debugging logs be an important step in resolving this issue?
A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue.
What command should be used?
A system administrator wants to run these two commands in Base Command Manager.
main
showprofile device status apc01
What command should the system administrator use from the management node system shell?
You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.
What is a key requirement for enabling cloudbursting across multiple cloud providers?
You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users.
What mechanism is responsible for ensuring that the workload continues to run without interruption?
In a high availability (HA) cluster, you need to ensure that split-brain scenarios are avoided.
What is a common technique used to prevent split-brain in an HA cluster?
You are managing multiple edge AI deployments using NVIDIA Fleet Command. You need to ensure that each AI application running on the same GPU is isolated from others to prevent interference.
Which feature of Fleet Command should you use to achieve this?
You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command Manager (BCM). The system is experiencing slow performance, and you need to identify the cause.
What is the most effective way to monitor GPU usage across nodes?