Introduction to NVIDIA Mission Control
This manual provides detailed information on the NVIDIA Mission Control features integrated within NVIDIA Base Command Manager (BCM) version 11. It is designed for cluster administrators to effectively install, configure, and manage these advanced capabilities on NVIDIA B200 and GB200 platforms.
NVIDIA Mission Control extends BCM's functionality, offering features such as Building Management System (BMS) integration, advanced leak detection, autonomous hardware recovery, NMX for network monitoring, and comprehensive rack management for DGX GB200 systems. It also includes power management and firmware updates.
For the latest documentation and support, NVIDIA recommends visiting NVIDIA Docs.
Key Features and Management
- NMX Settings for NVLink Monitoring: Configure and monitor NMX telemetry services for NVLinks and NVLink switches.
- Rack Management: Efficiently manage data center racks and their components, including nodes, switches, and power shelves, with commands like
rackoverview
anddisplay
. - BCM Power Shelf Integration: Manage power shelves, including networking, access configuration, settings, metrics, and firmware updates.
- NVIDIA Autonomous Hardware Recovery: Automate hardware management to enhance cluster uptime.
- DGX GB200 Measurables: Access detailed metrics for DGX GB200 systems, covering circuit information, leak detection, NVLink, power, cooling, GPU performance, and Redfish data.
Support and Services
For technical assistance, contact NVIDIA support through their enterprise support portal: NVIDIA Enterprise Support.
Professional services can be explored via the NVIDIA Enterprise Services page.