Weekend Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70percent

NVIDIA NCP-AIO NVIDIA AI Operations Exam Practice Test

Demo: 19 questions
Total 66 questions

NVIDIA AI Operations Questions and Answers

Question 1

You are managing an on-premises cluster using NVIDIA Base Command Manager (BCM) and need to extend your computational resources into AWS when your local infrastructure reaches peak capacity.

What is the most effective way to configure cloudbursting in this scenario?

Options:

A.

Use BCM's built-in load balancer to distribute workloads evenly between on-premises and cloud resources without any pre-configuration.

B.

Manually provision additional cloud nodes in AWS when the on-premises cluster reaches its limit.

C.

Set up a standby deployment in AWS and manually switch workloads to the cloud during peak times.

D.

Use BCM's Cluster Extension feature to automatically provision AWS resources when local resources are exhausted.

Question 2

You are a Solutions Architect designing a data center infrastructure for a cloud-based AI application that requires high-performance networking, storage, and security. You need to choose a software framework to program the NVIDIA BlueField DPUs that will be used in the infrastructure. The framework must support the development of custom applications and services, as well as enable tailored solutions for specific workloads. Additionally, the framework should allow for the integration of storage services such as NVMe over Fabrics (NVMe-oF) and elastic block storage.

Which framework should you choose?

Options:

A.

NVIDIA TensorRT

B.

NVIDIA CUDA

C.

NVIDIA NSight

D.

NVIDIA DOCA

Question 3

A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.

What role should you assign them in Run:ai?

Options:

A.

L1 Researcher

B.

Department Administrator

C.

Application Administrator

D.

Research Manager

Question 4

A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.

Why would generating core dumps be a critical step in troubleshooting this issue?

Options:

A.

Core dumps prevent future crashes by stopping any further execution of the faulty process.

B.

Core dumps provide real-time logs that can be used to monitor ongoing application performance.

C.

Core dumps restore the process to its previous state, often fixing the error-causing crash.

D.

Core dumps capture the memory state of the process at the time of the crash.

Question 5

An administrator wants to check if the BlueMan service can access the DPU.

How can this be done?

Options:

A.

Via system logs

B.

Via the DOCA Telemetry Service (DTS)

C.

Via a lightweight database operating in the DPU server

D.

Via Linux dump files

Question 6

You are setting up a Kubernetes cluster on NVIDIA DGX systems using BCM, and you need to initialize the control-plane nodes.

What is the most important step to take before initializing these nodes?

Options:

A.

Set up a load balancer before initializing any control-plane node.

B.

Disable swap on all control-plane nodes before initializing them.

C.

Ensure that Docker is installed and running on all control-plane nodes.

D.

Configure each control-plane node with its own external IP address.

Question 7

A cloud engineer is looking to provision a virtual machine for machine learning using the NVIDIA Virtual Machine Image (VMI) and Rapids.

What technology stack will be set up for the development team automatically when the VMI is deployed?

Options:

A.

Ubuntu Server, Docker-CE, NVIDIA Container Toolkit, CSP CLI, NGC CLI, NVIDIA Driver

B.

Cent OS, Docker-CE, NVIDIA Container Toolkit, CSP CLI, NGC CLI

C.

Ubuntu Server, Docker-CE, NVIDIA Container Toolkit, CSP CLI, NGC CLI, NVIDIA Driver, Rapids

D.

Ubuntu Server, Docker-CE, NVIDIA Container Toolkit, CSP CLI, NGC CLI

Question 8

A system administrator needs to collect the information below:

    GPU behavior monitoring

    GPU configuration management

    GPU policy oversight

    GPU health and diagnostics

    GPU accounting and process statistics

    NVSwitch configuration and monitoring

What single tool should be used?

Options:

A.

nvidia-smi

B.

CUDA Toolkit

C.

DCGM

D.

Nsight Systems

Question 9

An administrator is troubleshooting a bottleneck in a deep learning run time and needs consistent data feed rates to GPUs.

Which storage metric should be used?

Options:

A.

Disk I/O operations per second (IOPS)

B.

Disk free space

C.

Sequential read speed

D.

Disk utilization in performance manager

Question 10

What is the primary purpose of assigning a provisioning role to a node in NVIDIA Base Command Manager (BCM)?

Options:

A.

To configure the node as a container orchestration manager

B.

To enable the node to monitor GPU utilization across the cluster

C.

To allow the node to manage software images and provision other nodes

D.

To assign the node as a storage manager for certified storage

Question 11

You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.

To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?

Options:

A.

Use the runai-adm command to directly update Kubernetes nodes without requiring kubectl.

B.

Use the CLI to manually allocate specific GPUs to individual jobs for better resource management.

C.

Ensure that the Kubernetes configuration file is set up with cluster administrative rights before using the CLI.

D.

Install the CLI on Windows machines to take advantage of its scripting capabilities.

Question 12

A system administrator is troubleshooting a Docker container that is repeatedly failing to start. They want to gather more detailed information about the issue by generating debugging logs.

Why would generating debugging logs be an important step in resolving this issue?

Options:

A.

Debugging logs disable other logging mechanisms, reducing noise in the output.

B.

Debugging logs provide detailed insights into the Docker daemon's internal operations.

C.

Debugging logs prevent the container from being removed after it stops, allowing for easier inspection.

D.

Debugging logs fix issues related to container performance and resource allocation.

Question 13

A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue.

What command should be used?

Options:

A.

tcpdump

B.

iostat

C.

nvidia-smi

D.

htop

Question 14

A system administrator wants to run these two commands in Base Command Manager.

main

showprofile device status apc01

What command should the system administrator use from the management node system shell?

Options:

A.

cmsh -c “main showprofile; device status apc01”

B.

cmsh -p “main showprofile; device status apc01”

C.

system -c “main showprofile; device status apc01”

D.

cmsh-system -c “main showprofile; device status apc01”

Question 15

You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.

What is a key requirement for enabling cloudbursting across multiple cloud providers?

Options:

A.

You only need to configure credentials for one cloud provider, as BCM will automatically replicate them across other providers.

B.

You need to set up a single set of credentials that works across both AWS and Azure for seamless integration.

C.

You must configure separate credentials for each cloud provider in BCM to enable their use in the cluster extension process.

D.

BCM automatically detects and configures credentials for all supported cloud providers without requiring admin input.

Question 16

You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users.

What mechanism is responsible for ensuring that the workload continues to run without interruption?

Options:

A.

Load balancing across all nodes in the cluster.

B.

Manual intervention by the system administrator to restart services.

C.

The failover mechanism that automatically transfers workloads to a standby node.

D.

Data replication between nodes to ensure data integrity.

Question 17

In a high availability (HA) cluster, you need to ensure that split-brain scenarios are avoided.

What is a common technique used to prevent split-brain in an HA cluster?

Options:

A.

Configuring manual failover procedures for each node.

B.

Using multiple load balancers to distribute traffic evenly across nodes.

C.

Implementing a heartbeat network between cluster nodes to monitor their health.

D.

Replicating data across all nodes in real time.

Question 18

You are managing multiple edge AI deployments using NVIDIA Fleet Command. You need to ensure that each AI application running on the same GPU is isolated from others to prevent interference.

Which feature of Fleet Command should you use to achieve this?

Options:

A.

Remote Console

B.

Secure NFS support

C.

Multi-Instance GPU (MIG) support

D.

Over-the-air updates

Question 19

You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command Manager (BCM). The system is experiencing slow performance, and you need to identify the cause.

What is the most effective way to monitor GPU usage across nodes?

Options:

A.

Check the job logs in Slurm for any errors related to resource requests.

B.

Use the Base View dashboard to monitor GPU, CPU, and memory utilization in real-time.

C.

Run the top command on each node to check CPU and memory usage.

D.

Use nvidia-smi on each node to monitor GPU utilization manually.

Demo: 19 questions
Total 66 questions