NVIDIA.
NVIDIA DGX OS Server Release 4.9
Release Notes and Update Guide
DA-08260-490_v01 | July 2021
Primary Changes in Release 4.9
The following are the primary new features of DGX OS Server Release 4.8 since Release 4.7:
- Updated the NVIDIA Release 418 GPU driver to 418.211.00
- Updated the NVIDIA Release 450 GPU driver to 450.142.00
Delivery and Update Mechanisms
Initial 4.9 Release
DGX OS Server Release 4.9, version 4.9.0, is provided as an ISO image which is available from NVIDIA Enterprise Support in the event the server needs to be re-imaged. Version 4.9.0 is also provided as an “over-the-network" update, which requires an internet connection and ability to access the NVIDIA public repositories.
Refer to the DGX-2 User Guide (https://docs.nvidia.com/dgx/dgx2-user-guide/index.html) and DGX-1 User Guide (https://docs.nvidia.com/dgx/dgx1-user-guide/index.html) for the following instructions:
- How to re-image the system with the ISO image
- How to install the software on air-gapped systems
Update Advisement
NVIDIA GPU Cloud Containers
In conjunction with DGX OS Server v4.9, customers should update their NVIDIA GPU Cloud containers to the latest container release.
Ubuntu Security Updates
Customers are responsible for keeping the DGX server up to date with the latest Ubuntu security updates using the 'apt full upgrade' procedure. See the Ubuntu Wiki Upgrades web page for more information. Also, the Ubuntu Security Notice site lists known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the DGX OS software.
Version History
This section lists the changes made in each released version of DGX OS Release 4.9. See DGX OS Server Software Content for the software component list and versions.
Version 4.9.0
- Initial Release 4.9 version.
- Changes since Version 4.8.0:
- Updated the NVIDIA Release 418 GPU driver to 418.211.00
- Updated the NVIDIA Release 450 GPU driver to 450.142.00
- Updated NVSM (for Release 450 driver package) to 20.09.33
- Updated DCGM (for Release 450 driver package) to 2.2.8
- Updated DGX-2 KVM image
- Updated NVIDIA Container Toolkit (nvidia-container-runtime – 3.3.0)
- Improved process for updating the ConnectX card firmware. If updating from DGX OS 4.8 or later, the firmware for all cards are now updated in parallel instead of one at a time, significantly reducing the time to update all cards.
See the NVIDIA Deep Learning Frameworks documentation website (http://docs.nvidia.com/deeplearning/dgx/index.htm) for information on the latest container releases as well as https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html for instructions on how to access them.
DGX OS Server Software Content
The following tables provides version information for software included in the DGX OS Server ISO image as well as software installed on the system after getting subsequent updates.
Package Versions in Version 4.9.0
The following table shows the version information for software included in the DGX OS Server version 4.8.0.
Component | Version (R418 package) | Version (R450 package) |
---|---|---|
GPU Driver | 418.211.00 (includes CUDA update to 10.1, if previously installed separately) | 450.142.00 (includes CUDA update to 11.0.3, if previously installed separately) |
Fabric Manager | N/A | 450.142.00 |
NVIDIA System Health Monitor (NVSM) | 20.03.6 | 20.09.33 |
Data Center GPU Management (DCGM) | 1.7.4 | 2.2.8 |
NVIDIA Container Toolkit | nvidia-container-runtime 3.3.0 | |
Ubuntu | 18.04.4 LTS | |
Ubuntu kernel | 4.15.0-1472 | |
Docker Engine | 19.03.15 | |
Mellanox OFED | MLNX4.9-2.2.6.0 |
KVM Package Components (DGX-2 only)
Version | |
---|---|
dgx-kvm-sw | 19.07.0 |
dgx-kvm-host-utils | 21.01.0 |
dgx-kvm-host-conf | 20.12.0 |
dgx-kvm-image | dgx-kvm-image-4-9-0_4.9.0~210615-153549.0_amd64.deb |
If updating over-the-network, your kernel version may be a later version depending on when the update is performed.
DGX Server Firmware Version Reference
The Mellanox firmware is updated as part of the DGX OS update. The following are the updated versions for each product:
Product | Network Card | Version |
---|---|---|
NVIDIA DGX-1 | ConnectX-4 | 12.28.2006 |
ConnectX-5 | 16.28.2006 | |
NVIDIA DGX-2 | ConnectX-5 | 16.28.2006 |
ConnectX-6 | 20.28.2006 |
For other firmware, see the DGX-2 System Firmware Update Container Version 20.10.7.2 and DGX-1 System Firmware Update Container Version 20.10.2.1 release notes for the corresponding firmware versions available at the time of this DGX OS release.
Updating the Software
These instructions explain how to update the DGX OS server software through an internet connection to the NVIDIA public repository. The process updates a DGX system image to the latest versions of the entire DGX software stack, including the drivers. Perform the updates using commands on the DGX server console.
Preparing for Updating the Software
Connecting to the DGX server Console
Connect to the DGX server console using either a direct connection or a remote connection through the BMC.
Note: SSH can be used to perform the update. However, if the Ethernet port is configured for DHCP, there is the potential that the IP address can change after the DGX server is rebooted during the update, resulting in loss of connection. If this happens, connect using either a direct connection or through the BMC to continue the update process.
Warning: Connect directly to the DGX server console if the DGX is connected to a 172.17.xx.xx subnet.
DGX OS Server software installs Docker CE which uses the 172.17.xx.xx subnet by default for Docker containers. If the DGX server is on the same subnet, you will not be able to establish a network connection to the DGX server.
Refer to the appropriate DGX-1 or DGX-2 User Guide for instructions on how to change the default Docker network settings after performing the update.
Direct Connection
- Connect a display to the VGA connector and a keyboard to any one of the USB ports.
- Power on the DGX server.
Remote Connection through the BMC
Refer to the appropriate user guide (DGX-1 or DGX-2) for instructions on establishing a remote connection to the BMC.
Verifying the DGX Server Connection to the Repositories
Before attempting to perform the update, verify that the DGX server network connection can access the public repositories and that the connection is not blocked by a firewall or proxy.
On DGX-1 Systems if Upgrading from Version 2.x.
Enter the following on the DGX-1 system.
$ wget -O f1-changelogs http://changelogs.ubuntu.com/meta-release-lts
$ wget -O f2-archive http://archive.ubuntu.com/ubuntu/dists/xenial/Release
$ wget -O f3-usarchive http://us.archive.ubuntu.com/ubuntu/dists/xenial/Release
$ wget -O f4-security http://security.ubuntu.com/ubuntu/dists/xenial/Release
$ wget -O f5-download https://download.docker.com/linux/ubuntu/dists/xenial/Release
$ wget -O f6-international http://international.download.nvidia.com/dgx/repos/dists/xenial/Release
All the wget commands should be successful and there should be six files in the directory with non-zero content.
On DGX-2 and DGX-1 Systems
Enter the following on the DGX system
$ wget -O f1-changelogs http://changelogs.ubuntu.com/meta-release-lts
$ wget -O f2-archive http://archive.ubuntu.com/ubuntu/dists/bionic/Release
$ wget -O f3-usarchive http://us.archive.ubuntu.com/ubuntu/dists/bionic/Release
$ wget -O f4-security http://security.ubuntu.com/ubuntu/dists/bionic/Release
$ wget -O f5-international http://international.download.nvidia.com/dgx/repos/bionic/dists/bionic/Release
$ wget -O f6-international http://international.download.nvidia.com/dgx/repos/bionic/dists/bionic-r418+cuda10.1/Release
$ wget -O f7-international http://international.download.nvidia.com/dgx/repos/bionic/dists/bionic-r450+cuda11.0/Release
All the wget commands should be successful and there should be seven files in the directory with non-zero content.
Performing the Updates
Update Path Instructions
Follow the instructions corresponding to your current DGX OS server software.
- Updating from Release 4.1 and later: Follow the instructions at Updating from Release 4.1 and later.
- Updating from Release 4.0 (Version 4.0.1 or later only): Follow the instructions at Updating from 4.0.1 (or Later).
- Updating from Release 3.1: Follow the instructions at Updating from Release 3.1.
- Updating from Release 2.x:
- Update from Release 2.x to the latest Release 3.1 as described in the DGX OS 3.1.8 Release Notes.
- Update from Release 3.1.
Updating from Release 4.1 and Later
See the section Connecting to the DGX Console for guidance on connecting to the console to perform the update.
Caution: These instructions update all software for which updates are available from your configured software sources, including applications that you installed yourself. If you want to prevent an application from being updated, you can instruct the Ubuntu package manager to keep the current version. For more information, see Introduction to Holding Packages on the Ubuntu Community Help Wiki.
Update Instructions
- If you have not already done so, verify that your DGX system can access the public repositories as explained in Verifying the DGX Server Connection to the Repositories.
- (Optional) Skip this step to stay with the R418 package; however, to move to the R450 package, issue the following.
$ sudo apt update $ sudo apt install -y dgx-bionic-r450+cuda11.0-repo
- Update the list of available packages and their versions.
$ sudo apt update
- Review the packages that will be updated.
$ sudo apt full-upgrade -s
To prevent an application from being updated, instruct the Ubuntu package manager to keep the current version. See Introduction to Holding Packages.
- Upgrade to version 4.9.0.
$ sudo apt full-upgrade
Answer any questions that appear.
- Most questions require a Yes or No response. When asked to select the grub configuration to use, select the current one on the system.
- Other questions will depend on what other packages were installed before the update and how those packages interact with the update.
- If a message appears indicating that nvidia-docker.service failed to start, you can disregard it and continue with the next step. The service will start normally at that time.
- Reboot the system.
Recovering from an Interrupted or Failed Update
If the script is interrupted during the update, such as from a loss of power or loss of network connection, then restore power or restore the network connection, whichever caused the interruption.
- If the system encounters a kernel panic after you restore power and reboot the DGX-2, you will not be able to perform the over-the-network update. You will need to re-image the DGX-2 with the latest image (see the DGX-2 User Guide for instructions) and then perform the network update.
- If you are successfully returned to the Linux command line, continue following the instructions from step 2 in the Updating from Release 4.1 and later update instructions.
Updating from 4.0.1 (or later)
For Release 4.0, only updates from versions 4.0.1 and later are supported with these instructions. To update from version 4.0.0, you must re-image the system.
See the section Connecting to the DGX Console for guidance on connecting to the console to perform the update.
Caution: These instructions update all software for which updates are available from your configured software sources, including applications that you installed yourself. If you want to prevent an application from being updated, you can instruct the Ubuntu package manager to keep the current version. For more information, see Introduction to Holding Packages on the Ubuntu Community Help Wiki.
Update Instructions
- If you have not already done so, verify that your DGX system can access the public repositories as explained in Verifying the DGX Server Connection to the Repositories.
- Update the list of available packages and their versions.
$ sudo apt update
- Install the 4.1.0 components from the repository.
$ sudo apt install -y dgx-bionic-r418+cuda10.1-repo
- (Optional) Skip this step to stay with the R418 package; however, to move to the R450 package, issue the following.
$ sudo apt install -y dgx-bionic-r450+cuda11.0-repo
- Update the new list of packages and their versions.
$ sudo apt update
- Review the packages that will be updated.
$ sudo apt full-upgrade -s
To prevent an application from being updated, instruct the Ubuntu package manager to keep the current version. See Introduction to Holding Packages.
- Upgrade to version 4.8.0.
$ sudo apt full-upgrade
Answer any questions that appear.
- Most questions require a Yes or No response. When asked to select the grub configuration to use, select the current one on the system.
- Other questions will depend on what other packages were installed before the update and how those packages interact with the update.
- If a message appears indicating that nvidia-docker.service failed to start, you can disregard it and continue with the next step. The service will start normally at that time.
- Reboot the system.
Recovering from an Interrupted or Failed Update
If the script is interrupted during the update, such as from a loss of power or loss of network connection, then restore power or restore the network connection, whichever caused the interruption.
- If the system encounters a kernel panic after you restore power and reboot the DGX-2, you will not be able to perform the over-the-network update. You will need to re-image the DGX-2 with the latest image (see the DGX-2 User Guide for instructions) and then perform the network update.
- If you are successfully returned to the Linux command line, continue following the instructions from step 2 in the Updating from Version 4.0.1 (or Later) update instructions.
Updating from 3.1.x
See the section Connecting to the DGX Console for guidance on connecting to the console to perform the update.
Caution: These instructions update all software for which updates are available from your configured software sources, including applications that you installed yourself. If you want to prevent an application from being updated, you can instruct the Ubuntu package manager to keep the current version. For more information, see Introduction to Holding Packages on the Ubuntu Community Help Wiki.
Update Instructions
- If you have not already done so, verify that your DGX-1 system can access the public repositories as explained in Verifying the DGX Server Connection to the Repositories.
- Update the list of available packages and their versions.
$ sudo apt update
- Install any updates.
$ sudo apt -y full-upgrade
- Install dgx-release-upgrade.
$ sudo apt install -y dgx-release-upgrade
- Begin the update process.
$ sudo dgx-release-upgrade
If you are using a proxy server, then add the –E option to keep your proxy environment variables.
Example:
$ sudo -E dgx-release-upgrade
- After starting the update process, respond to the presented options as follows:
- Press y if you are logged in to the DGX server remotely through secure shell (SSH) and are asked if you want to continue running under SSH.
Continue running under SSH?
This session appears to be running under ssh. It is not recommended to perform a upgrade over ssh currently because in case of failure it is harder to recover.
If you continue, an additional ssh daemon will be started at port '1022'.
Do you want to continue?
Continue [yN]
An additional sshd daemon is started.
Press Enter in response to the following message.
Starting additional sshd
To make recovery in case of failure easier, an additional sshd will be started on port '1022'. If anything goes wrong with the running ssh you can still connect to the additional one.
If you run a firewall, you may need to temporarily open this port. As this is potentially dangerous it's not done automatically. You can open the port with e.g.:
'iptables -I INPUT -p tcp --dport 1022 -j ACCEPT'
To continue please press [ENTER]
- Press Enter in response to the message warning you that third-party sources are disabled.
Third party sources disabled
Some third party entries in your sources.list were disabled. You can re-enable them after the upgrade with the 'software-properties' tool or your package manager.
To continue please press [ENTER]
- Press N if prompted about dgx.list configuration choices.
Configuration file '/etc/apt/sources.list.d/dgx.list'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
- Y or I : install the package maintainer's version
- N or O : keep your currently-installed version
- D : show the differences between the versions
- Z : start a shell to examine the situation
The default action is to keep your current version.
dgx.list (Y/I/N/O/D/Z) [default=N] ?
- When prompted to resolve other configuration files, evaluate the changes before accepting the package maintainer's version, keeping the local version, or manually resolving the difference. You are also asked to confirm that you want to remove obsolete packages.
- At the prompt to confirm starting the upgrade, press Y to begin.
Do you want to start the upgrade?
Installing the upgrade can take several hours. Once the download has finished, the process cannot be canceled.
Continue [yN] Details [d]
- Press Y to proceed with the final reboot.
System upgrade is complete.
Restart required
To finish the upgrade, a restart is required.
If you select 'y' the system will be restarted.
Continue [yN]
After this reboot, the update process will take several minutes to perform some final installation steps.
Your system is now updated to the latest DGX OS 4 release.
- (Optional) Follow the instructions at Updating from Release 4.1 and Later if you want to install the R450 driver package.
- Press y if you are logged in to the DGX server remotely through secure shell (SSH) and are asked if you want to continue running under SSH.
Known Issues
This chapter captures the issues related to the DGX OS software or DGX hardware at the time of the software release.
Known Software Issues
The following are known issues with the software.
- DCGM Service Labelled as Deprecated
- NVSM May Raise ‘md1 is corrupted' Alert
- nvsm show health Reports Empty/proc/driver Folders
- NVSM Reports "Unknown" for Number of logical CPU cores on non-English system
- InfiniBand Bandwidth Drops for KVM Guest VMs
DCGM Service Labelled as Deprecated
Issue
When inquiring the status of dcgm.service, it is reported as deprecated.
$ sudo systemctl status dcgm.service
dcgm.service DEPRECATED. Please use nvidia-dcgm.service
Explanation
The message can be ignored. dcgm.service is, indeed, deprecated, but can still be used without issue. The name of the DCGM service is in the process of migrating from dcgm.service to nvidia-dcgm.service. During the transition, both are included in DCGM 2.2.8.
A later version of DGX OS 4 will enable nvidia-dcgm.service by default. You can enable nvidia-dcgm.service manually (even though there is no functional difference) as follows:
$ sudo systemctl stop dcgm.service
$ sudo systemctl disable dcgm.service
$ sudo systemctl start nvidia-dcgm.service
$ sudo systemctl enable nvidia-dcgm.service
NVSM May Raise ‘md1 is corrupted' Alert
Issue
On a system where one OS drive is used for the EFI boot partition and one is used for the root file system (each configured as RAID 1), NVSM raises 'md1 is corrupted' alerts.
Explanation
The OS RAID 1 drives are running in a non-standard configuration, resulting in erroneous alert messages. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.
To configure NVSM to support a custom drive partitioning, perform the following.
- Stop NVSM services.
$ systemctl stop nvsm
- Edit /etc/nvsm/nvsm.config and set the "use_standard_config_storage" parameter to false.
"use_standard_config_storage":false
- Remove the NVSM database.
$ sudo rm /var/lib/nvsm/sqlite/nvsm.db
- Restart NVSM.
$ systemctl restart nvsm
nvsm show health Reports Empty /proc/driver Folders
Issue
When issuing nvsm show health, the nvsmhealth_log.txt log file reports that the /proc/driver/folders are empty.
Example from a DGX-1
2020-09-01 20:03:05,204 INFO: Found empty path glob
"/proc/driver/nvidia/*/gpus/*/information"
2020-09-01 20:03:06,206 INFO: Found empty path glob
"/proc/driver/nvidia/*/registry"
2020-09-01 20:03:09,742 INFO: Found empty path glob
"/proc/driver/nvidia/*/params"
2020-09-01 20:03:10,743 INFO: Found empty path glob
"/proc/driver/nvidia/*/registry"
2020-09-01 20:03:11,745 INFO: Found empty path glob
"/proc/driver/nvidia/*/version"
2020-09-01 20:03:12,747 INFO: Found empty path glob
"/proc/driver/nvidia/*/warnings/*"
Explanation
This is an erroneous message as the folder content is actually loaded during the software installation. The message can be ignored. This will be resolved in a future NVSM release.
NVSM Reports "Unknown" for Number of logical CPU cores on non-English system
Issue
On systems set up for a non-English locale, the nvsm show health command lists the number of logical CPU cores as Unknown.
Number of logical CPU cores [None] Unknown
Resolution
This issue will be resolved in a later version of the DGX OS software.
InfiniBand Bandwidth Drops for KVM Guest VMs
Issue
The InfiniBand bandwidth when running on multi-GPU guest VMs is lower than when running on bare metal.
Explanation
Currently, performance when using GPUDirect within a guest VM will be lower than when used on a bare-metal system.
Known DGX-2 System Issues
The following are known issues specific to the DGX-2 server.
DGX KVM: nvidia-vm health-check May Fail
Issue
When running nvidia-vm health-check to check the health of specific GPUs used by the DGX KVM guest VM, the command may fail.
Example:
$ sudo nvidia-vm health-check --gpu-count 1 --gpu-index 0 --fulltest run
ERROR: Unexpected response from blacklist "connection"
ERROR: Unexpected response from blacklist "to"
ERROR: Unexpected response from blacklist "the"
ERROR: Unexpected response from blacklist "host"
ERROR: Unexpected response from blacklist "engine"
ERROR: Unexpected response from blacklist "is"
ERROR: Unexpected response from blacklist "not"
ERROR: Unexpected response from blacklist "valid"
ERROR: Unexpected response from blacklist "any"
ERROR: Unexpected response from blacklist "longer"
ERROR: No healthy/unhealthy data returned from blacklist command
Explanation and Resolution
This occurs because the health-check VM is created from an image based on the DGX OS ISO, which uses the R418 driver package, but the host was updated to the R450 driver package. The two packages use different DCGM releases which cannot communicate with each other, resulting in the error.
NVSM Does not Detect Downgraded GPU PCIe Link
Issue
If the GPU PCIe link is downgraded to Gen1, NVSM still reports the GPU health status as OK.
Explanation and Resolution
The NVSM software currently does not check for this condition. The check will be added in a future software release.
Known DGX-1 System Issues
The following are known issues specific to the DGX-1 server.
nvidia-nvswitch Version Mismatch Message Appears when Running DCGM
Issue
When starting the DCGM service, a version mismatch error message similar to the following will appear:
[78075.772392] nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06
Explanation
This occurs with GPU driver versions later than 450.51.06. The version check occurs on all DGX systems, but applies only to NVSwitch systems, so the message can be ignored on non-NVSwitch systems such as the DGX Station or DGX-1.
Forced Reboot Hangs the OS
Issue
When issuing reboot -f (forced reboot), I/O error messages appear on the console and then the system hangs.
The system reboots normally when issuing reboot.
Resolution
This issue will be resolved in a future version of the DGX OS server.
Known Issues Related to Ubuntu / Linux Kernel
The following are known issues related to the Ubuntu OS or the Linux kernel that affect the DGX server.
System May Slow Down When Using mpirun
Issue
Customers running Message Passing Interface (MPI) workloads may experience the OS becoming very slow to respond. When this occurs, a log message similar to the following would appear in the kernel log:
kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
Explanation
Due to the current design of the Linux kernel, the condition may be triggered when get_user_pages is used on a file that is on persistent storage. For example, this can happen when cudaHostRegister is used on a file path that is stored in an ext4 filesystem. DGX systems implement /tmp on a persistent ext4 filesystem.
Workaround
Note: If you performed this workaround on a previous DGX OS software version, you do not need to do it again after updating to the latest DGX OS version.
In order to avoid using persistent storage, MPI can be configured to use shared memory at /dev/shm (this is a temporary filesystem).
If you are using Open MPI, then you can solve the issue by configuring the Modular Component Architecture (MCA) parameters so that mpirun uses the temporary file system in memory.
For details on how to accomplish this, see the Knowledge Base Article DGX System Slows Down When Using mpirun (requires login to the NVIDIA Enterprise Support portal).
Known Limitations
This section list known limitations and other issues that will not be fixed.
- [DGX-2] srp_daemon Causes NVIDIA KVM Update Failure
- [DGX-2] Hot-plugging of Storage NVMe Drives is not Supported
- [DGX-2] Serial Over LAN Does not Work After Cold Resetting the BMC
- [DGX-2] Some BMC Dashboard Quick Links Appear Erroneously
- [DGX-2] Applications Cannot be Run Immediately Upon Powering on the DGX-2
- [DGX-2] PKCS Errors Appear When the System Boots
- [DGX-2 KVM] Logfile Setup Error When Creating a VM
- [DGX-2 KVM] nvidia-vm vmshow Command Does Not Work for Running VMs
- [DGX-1] Script Cannot Recreate RAID Array After Re-inserting a Known Good SSD
[DGX-2] srp_daemon Causes NVIDIA KVM Update Failure
Issue
When performing an over-the-network update on the NVIDIA KVM, the update fails with a “Package mlnx-ofed-all is not configured yet” message.
The issue does not occur if you have installed the DGX OS from the ISO.
Explanation
This issue is the result of the srp_daemon within the Mellanox driver. The daemon is used to discover and connect to InfiniBand SCSI RDMA Protocol (SRP) targets.
If you are not using RDMA, then disable the srp_daemon as follows.
sudo systemctl disable srp_daemon.service
sudo systemctl disable srptools.service
[DGX-2] Hot-plugging of Storage NVMe Drives is not Supported
Issue
Hot-plugging or hot-swapping one of the storage non-volatile memory express (NVMe) drive might result in system instability or incorrect device reporting.
Workaround and Resolution
Turn off the system before removing and replacing any of the storage NVMe drives.
[DGX-2] Serial Over LAN Does not Work After Cold Resetting the BMC
Issue
After performing a cold reset on the BMC (ipmitool mc reset cold) while serial over LAN (SOL) is active, you cannot restart a SOL session.
Workaround
To re-active SOL, either:
- Reboot the system, or
- Kill and then restart the process as follows.
c) Identify the Process ID of the SOL TTY process by running the following.
ps -ef | grep "/sbin/agetty -o -p \u --keep-baud 115200,38400,9600 ttyS0 vt220"
kill <PID>
where <PID> is the Process ID returned by the previous command.
e) Either wait for the cron job to respawn the process or manually restart the process by running
/sbin/agetty -o -p \u --keep-baud 115200,38400,9600 ttyS0 vt220
[DGX-2] Some BMC Dashboard Quick Links Appear Erroneously
Issue
On the BMC dashboard, the following Quick Links appear by mistake and should not be used.
- Maintenance->Firmware Update
- Settings->NvMeManagement->NvMe P3700Vpd Info
[DGX-2] Applications Cannot be Run Immediately Upon Powering on the DGX-2
Issue
When attempting to run an application that uses the GPUs immediately upon powering on the DGX-2 system, you may encounter the following error.
CUDA_ERROR_SYSTEM_NOT_READY
Explanation and Workaround
The DGX-2 uses a fabric manager service to manage communication between all the GPUs in the system. When the DGX-2 system is powered on, the fabric manager initializes all the GPUs. This can take approximately 45 seconds. Until the GPUs are initialized, applications that attempt to use them will fail.
If you encounter the error, wait and launch the application again.
[DGX-2] PKCS Errors Appear When the System Boots
Issue
When the DGX system boots, “PKCS#7 signature not signed with a trusts key" messages appear on the console and system logs.
Explanation
DGX OS Server installs Ubuntu 18.04, which checks all kernel modules for signatures even though Secure Boot is not enabled. Since the NVIDIA drivers are not part of the Ubuntu kernel, the drivers will be flagged with the message when the system boots. This does not affect the system nor indicate a problem with system software.
[DGX-2 KVM] Logfile Setup Error When Creating a VM
Issue
The following error may appear while creating a VM:
..Error setting up logfile: No write access to directory
/home/$USER/.cache/virt-manager
Workaround
To avoid the error, remove the /home/$USER/.cache/virt-manager directory after installing KVM packages or before running the first nvidia-vm command.
[DGX-2 KVM] nvidia-vm vmshow Command Does Not Work for Running VMs
Issue
When running nvidia-vm vmshow, the information for running guest VMs is reported as "Unknown".
[DGX-1] Script Cannot Recreate RAID Array After Re-inserting a Known Good SSD
Issue
When a good SSD is removed from the DGX-1 RAID 0 array and then re-inserted, the script to recreate the array fails.
Explanation and Workaround
After re-inserting the SSD back into the system, the RAID controller sets the array to offline and marks the re-inserted SSD as Unconfigured_Bad (UBad). The script will fail when attempting to rebuild an array when one or more of the SSDs are marked Ubad.
To recreate the array in this case,
- Set the drive back to a good state.
# sudo /opt/MegaRAID/storcli/storcli /c0/e<enclosure_id>/s<drive_slot> set good
- Run the script to recreate the array.
# sudo /usr/bin/configure_raid_array.py -c -f
Appendix A. Third Party License Notice
This NVIDIA product contains third party software that is being made available to you under their respective open source software licenses. Some of those licenses also require specific legal information to be included in the product. This section provides such information.
msecli
The msecli utility (https://www.micron.com/products/solid-state-storage/storage-executive-software) is provided under the following terms:
Micron Technology, Inc. Software License Agreement
PLEASE READ THIS LICENSE AGREEMENT ("AGREEMENT") FROM MICRON TECHNOLOGY, INC. ("MTI") CAREFULLY: BY INSTALLING, COPYING OR OTHERWISE USING THIS SOFTWARE AND ANY RELATED PRINTED MATERIALS ("SOFTWARE"), YOU ARE ACCEPTING AND AGREEING TO THE TERMS OF THIS AGREEMENT. IF YOU DO NOT AGREE WITH THE TERMS OF THIS AGREEMENT, DO NOT INSTALL THE SOFTWARE.
LICENSE: MTI hereby grants to you the following rights: You may use and make one
- backup copy the Software subject to the terms of this Agreement.
You must maintain all copyright notices on all copies of the Software.
You agree not to modify, adapt, decompile, reverse engineer, disassemble, or otherwise translate the Software. MTI may make changes to the Software at any time without notice to you.
In addition MTI is under no obligation whatsoever to update, maintain, or provide new versions or other support for the Software.
OWNERSHIP OF MATERIALS: You acknowledge and agree that the Software is proprietary property of MTI (and/or its licensors) and is protected by United States copyright law and international treaty provisions. Except as expressly provided herein, MTI does not grant any express or implied right to you under any patents, copyrights, trademarks, or trade secret information. You further acknowledge and agree that all right, title, and interest in and to the Software, including associated proprietary rights, are and shall remain with MTI (and/or its licensors). This Agreement does not convey to you an interest in or to the Software, but only a limited right to use and copy the Software in accordance with the terms of this Agreement. The Software is licensed to you and not sold.
DISCLAIMER OF WARRANTY: THE SOFTWARE IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. MTI EXPRESSLY DISCLAIMS ALL WARRANTIES EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD PARTY RIGHTS, AND ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. MTI DOES NOT WARRANT THAT THE SOFTWARE WILL MEET YOUR REQUIREMENTS, OR THAT THE OPERATION OF THE SOFTWARE WILL BE UNINTERRUPTED OR ERROR-FREE. FURTHERMORE, MTI DOES NOT MAKE ANY REPRESENTATIONS REGARDING THE USE OR THE RESULTS OF THE USE OF THE SOFTWARE IN TERMS OF ITS CORRECTNESS, ACCURACY, RELIABILITY, OR OTHERWISE. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE REMAINS WITH YOU. IN NO EVENT SHALL MTI, ITS AFFILIATED COMPANIES OR THEIR SUPPLIERS BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, INCIDENTAL, OR SPECIAL DAMAGES (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF YOUR USE OF OR INABILITY TO USE THE SOFTWARE, EVEN IF MTI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Because some jurisdictions prohibit the exclusion or limitation of liability for consequential or incidental damages, the above limitation may not apply to you.
TERMINATION OF THIS LICENSE: MTI may terminate this license at any time if you are in breach of any of the terms of this Agreement. Upon termination, you will immediately destroy all copies the Software.
GENERAL: This Agreement constitutes the entire agreement between MTI and you regarding the subject matter hereof and supersedes all previous oral or written communications between the parties. This Agreement shall be governed by the laws of the State of Idaho without regard to its conflict of laws rules.
CONTACT: If you have any questions about the terms of this Agreement, please contact MTI's legal department at (208) 368-4500.
By proceeding with the installation of the Software, you agree to the terms of this Agreement. You must agree to the terms in order to install and use the Software.
Mellanox (OFED)
MLNX OFED (http://www.mellanox.com/) is provided under the following terms:
Copyright (c) 2006 Mellanox Technologies. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.