Introduction to NVSM
NVIDIA® System Management (NVSM) is a software framework designed for monitoring NVIDIA DGX™ nodes within a data center environment. This guide details its capabilities, including active health monitoring, system alerts, and log generation for DGX Servers. For DGX Station, NVSM primarily facilitates health checks and diagnostic information retrieval via the command-line interface (CLI).
NVSM can be utilized as a standalone utility by system administrators for efficient management. The framework is structured around NVSM API services, DGX System Health Monitors (DSHM) for component health oversight, and the NVSM CLI for user interaction and control.
Key Features and Usage
This document explores various aspects of NVSM, including:
- Verifying NVSM API Services
- Configurable DSHM Features such as Health Monitor Alerts and Policies
- Detailed DSHM Alert List with descriptions, IDs, severity, and recommended actions
- Using the NVSM CLI, both interactively and non-interactively
- Examining system health through various commands
- System monitoring configuration, including email alerts and policy settings
- Performing system management tasks
- Configuring NVSM security
- Utilizing NVSM Call Home functionality
Release Information
The guide covers Release 20.09 of NVSM, including details on bug fixes and known issues for this version.
Further Information
For additional details and support, users can refer to the official NVIDIA documentation available at www.nvidia.com.