NVIDIA System Management User Guide

Introduction to NVSM

NVIDIA® System Management (NVSM) is a software framework designed for monitoring NVIDIA DGX™ nodes within a data center environment. This guide details its capabilities, including active health monitoring, system alerts, and log generation for DGX Servers. For DGX Station, NVSM primarily facilitates health checks and diagnostic information retrieval via the command-line interface (CLI).

NVSM can be utilized as a standalone utility by system administrators for efficient management. The framework is structured around NVSM API services, DGX System Health Monitors (DSHM) for component health oversight, and the NVSM CLI for user interaction and control.

Key Features and Usage

This document explores various aspects of NVSM, including:

  • Verifying NVSM API Services
  • Configurable DSHM Features such as Health Monitor Alerts and Policies
  • Detailed DSHM Alert List with descriptions, IDs, severity, and recommended actions
  • Using the NVSM CLI, both interactively and non-interactively
  • Examining system health through various commands
  • System monitoring configuration, including email alerts and policy settings
  • Performing system management tasks
  • Configuring NVSM security
  • Utilizing NVSM Call Home functionality

Release Information

The guide covers Release 20.09 of NVSM, including details on bug fixes and known issues for this version.

Further Information

For additional details and support, users can refer to the official NVIDIA documentation available at www.nvidia.com.

Models: System Management, System, Management

File Info : application/pdf, 70 Pages, 340.29KB

PDF preview unavailable. Download the PDF instead.

nvsm-user-guide

References

LaTeX with hyperref xdvipdfmx (20211117)

Related Documents

Preview NVIDIA DGX OS Server Release 4.9 Release Notes and Update Guide
This document provides release notes and an update guide for NVIDIA DGX OS Server Release 4.9, detailing primary changes, delivery and update mechanisms, version history, known issues, and limitations.
Preview NVIDIA DGX SuperPOD Deployment Guide
This document provides detailed instructions for deploying NVIDIA Base Command Manager on NVIDIA DGX SuperPOD configurations, covering initial cluster setup, head node configuration, and high availability setup.
Preview NVIDIA DGX B300 Datasheet: AI Factory Performance
Explore the NVIDIA DGX B300, a powerful AI infrastructure solution designed for AI factory performance, from training to inference. Learn about its key features, specifications, and how it enables enterprises to scale AI operations.
Preview Red Hat OpenShift on DGX User Guide
A user guide for installing and configuring Red Hat OpenShift 4 with Red Hat CoreOS on DGX worker nodes, including information on the NVIDIA GPU Operator and NVSM.
Preview NVIDIA DGX B200 Firmware Update Guide
This guide provides comprehensive instructions for updating the firmware of the NVIDIA DGX B200 system. It covers firmware update prerequisites, methods, steps, and troubleshooting for various components including BMC, SBIOS, BIOS, CPLDs, NVMe, Power Supply Units, PCIe Switches, PCIe Retimers, ConnectX-7, Intel NIC, and GPU tray components. The document also details the nvfwupd command-line utility and its syntax.
Preview NVIDIA DGX SuperPOD: Next-Generation AI Infrastructure Reference Architecture
This document outlines the reference architecture for the NVIDIA DGX SuperPOD, a scalable infrastructure designed for AI leadership. It details the key components, network fabrics, storage architecture, and software stack, including NVIDIA DGX GB200 systems, InfiniBand, NVLink, and Mission Control software, to power next-generation AI factories.
Preview NVIDIA DGX GB300 Datasheet: AI Infrastructure for the Era of Reasoning
Explore the NVIDIA DGX GB300, a purpose-built AI factory infrastructure designed for generative AI and large language models. Discover its key features, including the Grace Blackwell Ultra Superchips, liquid-cooled design, and NVIDIA networking, for accelerating state-of-the-art AI models.
Preview NVIDIA AI and Computing Training Paths: A Comprehensive Guide
Discover NVIDIA's extensive learning paths designed for professionals in AI, deep learning, accelerated computing, data science, robotics, networking, and more. Find tailored training for developers and administrators to advance your skills.