Atlas 300T Training Card
Technical White Paper (Model 9000)
Issue: 09
Date: 2023-01-09
Manufacturer: HUAWEI TECHNOLOGIES CO., LTD.
About This Document
Purpose
This document describes the Atlas 300T training card (model 9000) in detail, including its appearance, performance parameters, configurations, and application scenarios.
Intended Audience
- Presales engineers
- Technical support engineers
- Maintenance engineers
Disclaimer
The technical specifications described in this document include but are not limited to parameters and performance indicators and vary depending on the actual release. This technical white paper does not constitute a commitment or guarantee on technical specifications of related products. Huawei may update relevant information from time to time. Huawei reserves the right to update or correct the information about related products or solutions. Updates are described in detail in the latest release notes or introduction.
Symbol Conventions
Symbol | Description |
---|---|
[DANGER] | Indicates a hazard with a high level of risk which, if not avoided, will result in death or serious injury. |
[WARNING] | Indicates a hazard with a medium level of risk which, if not avoided, could result in death or serious injury. |
[CAUTION] | Indicates a hazard with a low level of risk which, if not avoided, could result in minor or moderate injury. |
[NOTICE] | Indicates a potentially hazardous situation which, if not avoided, could result in equipment damage, data loss, performance deterioration, or unanticipated results. NOTICE is used to address practices not related to personal injury. |
[NOTE] | Supplements the important information in the main text. NOTE is used to address information not related to personal injury, equipment damage, and environment deterioration. |
Change History
Issue | Release Date | Description |
---|---|---|
09 | 2023-01-09 | This issue is the ninth official release.
|
08 | 2022-07-15 | This issue is the eighth official release. Modified 3.1 Basic Specifications. |
07 | 2022-02-22 | This issue is the seventh official release. Updated 5.2 Out-of-Band Management. |
06 | 2021-12-24 | This issue is the sixth official release. Modified 1.1 Overview. |
05 | 2021-04-19 | This issue is the fifth official release. Modified 3.1 Basic Specifications. |
04 | 2020-12-10 | This issue is the fourth official release. Modified 1.1 Overview, 1.2 Front Panel, and 3.1 Basic Specifications. |
03 | 2020-10-10 | This issue is the third official release. Modified 3.1 Basic Specifications. |
02 | 2020-09-23 | This issue is the second official release. Modified 3.1 Basic Specifications. |
01 | 2020-06-10 | This issue is the first official release. |
1 Product Description
1.1 Overview
The Huawei Atlas 300T training card (model 9000) is an AI accelerator card that works with servers to provide powerful computing power for data centers. A single card provides up to 220 TFLOPS FP16 computing power, accelerating deep learning training. The card features superior computing power, high integration, and high bandwidth to meet the requirements for AI training of Internet, carrier, and finance industries and computing power of high-performance computing.
1.2 Front Panel
Textual description of Figure 1-1 Appearance: A visual representation of the Huawei Atlas 300T training card, showcasing its sleek, metallic casing with red accents.
Figure 1-2 shows the front panel of an Atlas 300T training card (model 9000). Table 1-1 describes the indicators on the front panel.
Textual description of Figure 1-2 Front panel: Diagram of the Atlas 300T training card's front panel, featuring multiple network ports (indicated by '+') and two sets of indicator lights labeled 'LINK/ACT' and 'SPEED', numbered 1 and 2.
No. | Silkscreen | Meaning | Color | State Description |
---|---|---|---|---|
1 | LINK/ACT indicator | Network port status indicator | Green |
|
2 | SPEED indicator | Green |
|
[NOTE] Only indicators of group 1 on the left of the port are supported.
Figure 1-3 shows the port. Table 1-2 describes the port.
Textual description of Figure 1-3 Port: A detailed view of the network port area on the Atlas 300T training card, showing the QSFP-DD port and associated indicator lights.
Name | Type | Quantity | Description |
---|---|---|---|
QSFP-DD port | QSFP-DD | 1 | The current driver of a PCIe training card supports only one 100GE port. The capability of extending to two 100GE ports is reserved. |
1.3 System Architecture
Textual description of Figure 1-4 System architecture: A block diagram illustrating the internal architecture of the Atlas 300T training card. It highlights the Ascend 910 AI Processor as the core component, connected to DDR (ECC) memory, a PCIe 4.0 x16 interface, a QSFP-DD network interface, and power management. An Intelligent Baseboard Management Controller (iBMC) and an MCU are also depicted.
- As the core of the Atlas 300T training card (model 9000), the Ascend 910 AI Processor supports a 2-rank DDRC interface with a maximum rate of 2400 Mbit/s, and supports 64-bit DDR4 SDRAMs with a maximum capacity of 16 GB.
- The Intelligent Baseboard Management Controller (iBMC) obtains the PCB, BOM version, board temperature, power consumption, and power voltage information from the MCU.
- The Ascend 910 AI Processor is powered by a multi-phase power supply with a high energy efficiency ratio and Huawei-developed PSIP.
2 Features
2.1 Performance
- High Integration: Three-in-one integration of AI computing, general computing, and I/O capabilities. Thirty Huawei Da Vinci AI Cores, sixteen TaiShan cores, and one 100GE RoCE v2 NIC for processors.
- Supreme computing power: Thirty built-in Da Vinci AI Cores. Industry-leading 220 TFLOPS FP16 computing power.
- High-speed network bandwidth: PCIe 4.0 and 1 x 100GE RoCE high-speed interface, with a total egress bandwidth of 56.5 Gbit/s. 10-70% improvement in the efficiency of data training and gradient synchronization, without the need for external NICs.
2.2 Maintainability
- Supports in-band online upgrades to facilitate routine maintenance.
- Allows users to obtain device status information such as the temperature, voltage, and power consumption in in-band or out-of-band mode.
- Provides comprehensive command line management functions for users to perform routine device management by using various commands.
- Supports in-band and out-of-band asset management and provides such information as serial numbers to facilitate asset management.
2.3 Typical Application Scenarios
The Atlas 300T training card (model 9000) is typically used in man-machine interactions in an AI training scenario, as shown in Figure 2-1.
Textual description of Figure 2-1 Typical single-node application scenario: A diagram illustrating a typical AI training workflow involving an Algorithm engineer who interacts with an AI server. Equipment production personnel and a System administrator are also shown, managing and monitoring the system, indicating roles in deployment and operation.
- System administrator: uses the iBMC to manage devices in out-of-band mode, including OS installation, firmware upgrade, server system information query, and troubleshooting.
- Equipment production personnel: use the equipment system to interact with the iBMC (out-of-band) and OS (in-band).
- Algorithm engineers: use an AI framework such as TensorFlow to develop network models, debug training code, import training data sets, start training, observe the training process (including the loss trends of multiple iterations), and export trained models.
3 Specifications
3.1 Basic Specifications
Table 3-1 lists the basic specifications.
Item | Specifications |
---|---|
Form factor | FHFL dual-slot (10.5 inches) |
AI processor | Ascend 910 AI Processor Thirty Huawei Da Vinci AI Cores and sixteen TaiShan cores integrated |
Memory |
|
AI computing powera |
|
Encoding/Decoding capability | 16-channel 4K (or 64-channel 1080p) 60 FPS H.264/H.265
|
Virtual instance specifications | One Ascend AI Processor can be divided into several virtual NPUs in virtualization mode. Each virtual NPU supports 2, 4, 8, or 16 AI Cores, and other hardware resources (such as memory) are divided proportionally. |
PCIe port | PCIe x16 Gen4.0 |
PCI IDs | Vendor ID: 0x19E5 Device ID: 0xD801 Subsystem vendor ID: 0x0200 Subsystem device ID: 0x0100 |
Network | 1 x 100GE QSFP-DD port, supporting RoCE |
Power consumption | A maximum of 300 W |
Heat dissipation mode | Passive air cooling |
Dimensions (L x W x H) | 266.7 mm x 111.15 mm x 39.04 mm |
Weight | 1.2 kg |
OS | For details, see the Computing Product Compatibility Checker. |
a: stable, maximum computing power.
3.2 Environmental Specifications
Table 3-2 lists the hardware application environment conditions.
Item | Specifications |
---|---|
Temperature |
|
Relative humidity |
|
Maximum altitude | ≤ 3,050 m (10,006.56 ft) [NOTE] ASHRAE 2015 compliant:
|
3.3 Clock Requirements
The Atlas 300T training card (model 9000) complies with PCI Express® Card Electromechanical Specification Revision 4.0. The entire card requires only the standard PCIe 4.0 clock, and the signal quality meets the PCIe specifications.
3.4 Hot Swap
The Atlas 300T training card (model 9000) does not support orderly hot swap and surprise hot swap.
3.5 Power Management
The Atlas 300T training card (model 9000) complies with PCI Express® Card Electromechanical Specification Revision 4.0. The maximum power consumption of the entire card is 300 W, which requires that the card slot provide a 5.5 A@12 V or 0.5 A@3.3 V standard power supply and the auxiliary power connector provide a 18.75 A@12 V power supply.
The pin definition of the auxiliary power connector is as follows.
No. | Signal Definition | Description |
---|---|---|
1 | GND | Grounded |
2 | GND | |
3 | GND | |
4 | GND | |
5 | 12 V | 12 V power cable |
6 | 12 V | |
7 | 12 V | |
8 | 12 V |
3.6 Heat Dissipation Specifications
3.6.1 Requirements
The Atlas 300T training card (model 9000) is used in an active heat dissipation environment with fans. It supports bidirectional air intake and air exhaust. The air volume must meet the heat dissipation requirements listed in Table 3-3.
Mean Temperature at the Air Intake Vent (°C) | Minimum Wind Speed Required by the Air Intake Vent (CFM) | Pressure Drop (Pa) |
---|---|---|
25 | 15 | 68 |
30 | 16 | 178 |
35 | 19 | 225 |
40 | 23 | 279 |
45 | 29 | 341 |
[NOTE]
- The ambient temperature at the heat sink inlet refers to the mean temperature at the air intake vent.
- The required air volume is a recommended value. The air volume and temperature provided by different systems for the Atlas 300T training card (model 9000) may be different. Determine the air volume and temperature based on the actual system.
- When the Atlas 300T training card (model 9000) is powered on, the minimum air volume required for heat dissipation is 5.0 CFM.
3.6.2 Specifications
The air intake temperature supported by the Atlas 300T training card (model 9000) ranges from 5°C to 45°C. There is a temperature monitoring point inside the card. The Ascend 910 and storage chip can be monitored in real time in both in-band and out-of-band modes to ensure that the card temperature is lower than the specified threshold. See Table 3-4.
Specifications | Ascend 910 AI Core Temperature (°C) | Ascend 910 HBM Temperature (°C) |
---|---|---|
Power-off temperature | 115 | 105 |
Underclocking temperature | 105 | 95 |
Long-term operating temperature | ≤ 105 | ≤ 95 |
4 Hardware Compatibility
The Atlas 300T training card (model 9000) supports Atlas 800 inference server (model 3000) and Atlas 800 inference server (model 3010).
5 Maintenance and Management
The Atlas 300T training card (model 9000) provides various maintenance and management functions, including in-band management command sets running in the OS and out-of-band management functions provided by the iBMC.
5.1 In-Band Management
- Online upgrade: The firmware is upgraded to facilitate device maintenance.
- Device management: allows users to obtain device status information such as the temperature, voltage, and power consumption.
- Command line management: allows users to perform routine device management by using various commands.
- Asset management: Information, such as serial numbers, is provided to facilitate asset management. For details about how to manage assets, see the Atlas 300T Training Card npu-smi Command Reference (Model 9000).
5.2 Out-of-Band Management
The Atlas 300T training card (model 9000) provides the SMBus interface to support the out-of-band management of servers. The iBMC provides the out-of-band management function and asset information, and monitors the temperature, voltage, real-time power consumption, and chip status of the Atlas 300T training card (model 9000). In addition, the iBMC can manage alarms of the Atlas 300T training card (model 9000).
- For details about the out-of-band management functions of the Atlas 300T training card (model 9000), see the iBMC User Guide of the server you use.
- For details about alarms of the Atlas 300T training card (model 9000), see the iBMC Alarm Handling of the server you use.
6 Certifications
No. | Country/ Region | Certifica tion | Standard |
---|---|---|---|
1 | Europe | CE | Safety:
EN IEC 63000:2018 |
2 | Europe | RCM EMC |
|
3 | Europe | FCC EMC | FCC CFR47 Part 15 Subpart B |
4 | Europe | ICES EMC | ICES-003 Issue 7: 2020 ICES Gen Issue 1: 2018 |
5 | UK | UKCA | Safety: BS EN 62368-1:2014+A11:2017 EMC:
BS EN IEC 63000:2018 |
6 | Europe | RoHS | EN IEC 63000: 2018 & BS EN IEC 63000: 2018 |
7 | Europe | VEEE | 2012/19/EU |
8 | Commodity Inspection | Refer to the product certification certificate. |
7 Warranty
For details, see Maintenance & Warranty.
A Acronyms and Abbreviations
A | Artificial Intelligence |
B | Intelligent Baseboard Management Controller |
C | Cubic Feet Per Minute |
E | Error Checking and Correcting |
O | Operating System |
P | Peripheral Component Interconnect Express |
S | System Management Bus |