DELL Technologies XE9680L Featuring AI Factory Rack Scale Architecture
Abstract
This technical whitepaper provides information about the iDRAC telemetry that is collected by OpenManage Enterprise and forwarded to AIOps Observability.
(formerly known as CloudIQ). The iDRAC telemetry feature is enabled by installing the AIOps plugin in OpenManage Enterprise. This enables AIOps Observability customers to view and report to various metrics associated with the system (for example, power, thermal, and utilization), and for various components—for example, Networking, Storage, and Graphics Processing Unit (GPU)—in a PowerEdge server.
December 2024
Revisions
Date | Description |
November 2021 | Initial release |
December 2024 | Updated release |
Acknowledgments
Authors:
- Muralidhar Kolli, Software Principal Engineer, Enterprise Systems Management
- Vijayasimha Naga, Software Senior Principal Engineer, Enterprise Systems Management
- Sudhir Shetty, Distinguished Engineer, Enterprise Systems Management
- Mahantesh Tippimath, Software Principal Engineer, Enterprise Systems Management
- Support: Mansi Manocha, Content Engineer 2
- The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
- Use, copying, and distribution of any software described in this publication requires an applicable software license.
- This document may contain certain words that are not consistent with Dell’s current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly.
- This document may contain language from third party content that is not under Dell’s control and is not consistent with Dell’s current guidelines for Dell’s own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly.
- Copyright © 2024 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [12/20/2024] [Technical Whitepaper] [646]
Executive summary
Open Manage Enterprise (OME), as a management console, supports discovery and management of various devices in a data center like servers, storage & network devices. This also retrieves telemetry data from supported devices to provide a consolidated view of their performance, efficiency, and utilization. This technical whitepaper provides an overview of the AIOps Observability plugin and OpenManage Enterprises underlying infrastructure that facilitates metric collection from PowerEdge servers and sends the collected telemetry data to AIOps Observability on a periodic basis.
AIOps Observability Overview
Dell PowerEdge servers, equipped with iDRAC, provide a variety of metrics using out-of-band management interfaces like WS-Man and Redfish. OpenManage Enterprise uses the iDRAC APIs to collect metrics for the PowerEdge servers and sends the collected data to AIOps Observability using Dell connectivity service.
The following diagram illustrates the flow of metrics from PowerEdge servers to AIOps Observability (formerly known as CloudIQ) using OpenManage Enterprise:
The different components in OpenManage Enterprise that facilitate this functionality are described here:
- AIOps Observability Plugin—This plugin enables OpenManage Enterprise to make necessary configurations as follows:
- Define the groups of PowerEdge servers to be monitored in AIOps Observability.
- Configure secure connectivity with AIOps Observability using Dell Connectivity Service.
- Metrics Collection Service—This service manages the periodic tasks that are responsible for:
- Configuring iDRAC on the selected servers to generate and report metrics.
- Collecting metrics from the selected servers using WS-Man or Redfish.
- Data Forwarding Service—This service manages the periodic tasks that are responsible to:
- Forward the metrics collected to AIOps Observability.
- Forward inventory, health, and alert information to AIOps Observability.
Prerequisites
iDRAC must have one of the following licenses installed for metrics to be collected using OpenManage Enterprise:
- Enterprise License
- OpenManage Enterprise Advanced License
- Datacenter License
- For more information about the list of metrics for these licenses, see Appendix.
Configure the AIOps Observability plugin in OpenManage Enterprise
- AIOps Observability plugin must be installed in OpenManage Enterprise and should be in active state to allow the registration of PowerEdge servers to monitor and configure OpenManage Enterprise connectivity with AIOps Observability.
- The connection must be established with AIOps Observability using Dell connectivity service. The status of connection must be connected to maintain the continuity of metric flow to AIOps Observability.
- After OpenManage Enterprise is configured with AIOps Observability plugin, add one or more server groups to the AIOps Observability Managed groups. This allows the metrics collection service in OpenManage Enterprise to begin metric collection from the servers connected to these managed groups. For the servers that support advanced Redfish Telemetry, the metrics collection service requires additional configurations on iDRAC to generate metric reports at specific intervals.
- The Metrics Collection Service performs a periodic task at an interval of 15 minutes to collect metrics from the registered PowerEdge servers. For more information about the complete list of metrics and the associated collection intervals, see Appendix. The collected metrics are saved in a time-series database.
- The Data Forwarding Service performs a task that continuously reads metric records from the time series database and transfers them in a compressed format to AIOps Observability. After the data is transferred successfully, it is purged from the OpenManage Enterprise database.
- Note: If telemetry feature is disabled in iDRAC manually, or if the metric report definition provisioned by OpenManage Enterprise on individual servers is deleted, it results in temporary loss of data. However, metrics collection service will identify and reconfigure iDRAC to resume the generation of metrics. For more information, see the Troubleshooting section.
Monitor metric collection status in OpenManage Enterprise
Monitor the overall status of metrics collection on OpenManage Enterprise User Interface (UI) by following the steps:
- Click Home Monitor Jobs.
- Apply the filter by selecting the Job Type as Metrics_Task and Source as System generated.
- Click View Details to view the individual status of last few metrics collection cycles. This provides information about the time taken for each cycle, and if the cycle is completed successfully for all the servers, or not.
- To see the status of metric collection for each individual server within a cycle, click individual rows. This provides information about the time taken for collecting all the supported metrics for the server and a summary of the number of metrics collected.
The number of metric samples collected for each server may vary and it depends on the licenses installed on the device, and its hardware inventory.
Monitor metric data transfer to AIOps Observability in OpenManage Enterprise
Monitor the status of metric data transfer to AIOps Observability on OpenManage Enterprise by performing the following:
- Click Home Plugins AIOps Observability Transfer Log.
- Apply the filter by selecting Category as the telemetry type.
- Click individual rows to view the status of each metric transfer.
This provides information about the name of the compressed file, its size, and the time taken to transfer the file successfully.
View metrics on AIOps Observability
The Overview page on AIOps Observability UI displays a consolidated view of the systems that includes PowerEdge servers that are monitored using OpenManage Enterprise.
- Click Overview Performance System Performance.
- Click Server to view the Thermal & System utilization metric summary of individual servers.
- To view the detailed individual performance metrics, click on one of the servers—This provides graphs for each of the metrics with a summary of average, minimum, and maximum values reached in the last 24 hours. It also provides the links to go to related hardware inventory (for example, processors, and memory).
- To create custom reports for other metrics, click Overview Page Reports Report Browser.
- Click Add Content and select the custom report.
- From the Product list, select PowerEdge option. This provides the list of metric categories.
- Select either of Line Chart or Table format for graphical representation of the custom report.
Note: All metrics may not support both Line Chart and Table view formats. For more information, see Onlinehelp documentation on Dell AIOps Observability portal. - Select a category from System list, displays the list of related metrics that can be selected along with the applicable component(s) to generate a custom report.
- Based upon the selected number of metrics and components, one or more graphs will appear under the reports.
Troubleshooting
Data transfer failures
If the system performance details or other metrics are missing on Dell AIOps Observability portal for any of the PowerEdge servers that are monitored using OpenManage Enterprise, ensure that you perform the following steps:
- Verify that the criteria described in Configuring AIOps Observability Plugin in OpenManage Enterprise is met.
- Check the connection status of OpenManage Enterprise with AIOps Observability.
- If the status is shown as Disconnected and is in Amber or Red for an extended period, contact technical support.
- Check the transfer logs as described in monitoring metric data transfer to AIOps Observability in the OpenManage Enterprise User’s guide available on the Dell support site.
If the errors are because of connection failure or intermittent service failures, then after restoration, ensure that you perform subsequent metric transfers to transmit the accumulated backlog of data.
Metrics collection failures
If the status of the connection between OpenManage Enterprise and AIOps Observability is successful, and there are no failures in data transfer, then there could potentially be failures in metric collection. The scenarios and recommended actions are described as follows:
- Scenario 1: Connection failure
Recommended Action: Ensure that the required device is powered on and detectable. - Scenario 2: Authentication failure
Recommended Action: Re-run the discovery of PowerEdge servers in OpenManage Enterprise using latest iDRAC credentials. - Scenario 3: Missing, invalid, or expired license
Recommended Action: Reload the valid license. For more information about licenses, see the Appendix. - Scenario 4: Incomplete metric retrieval
Recommended Action: If a server that supports Redfish Telemetry is newly registered for metric collection in OpenManage Enterprise, and if the metric collection is performed before the metrics collection service, then it results in the configuration of metric report definitions on iDRAC. If the Metric report definition is deleted manually on the server, results in errors in OpenManage Enterprise metric collection. However, the basic metrics available using WS-Man can be retrieved during that cycle. OpenManage Enterprise will try to automatically re-provision the metric report definitions. If successful, the next metrics collection cycle should retrieve the full set of metrics. - Scenario 5: Telemetry disabled
Recommended Action: If the Telemetry feature is disabled in iDRAC because of a factory reset or if it is manually disabled with the direct access to iDRAC, results in errors in OpenManage Enterprise. In such cases, OpenManage Enterprise will automatically enable Telemetry. If successful, the next metrics collection cycle should run without errors. - Scenario 6: iSM metrics not seen on AIOps Observability
Recommended Actions:
- Remove and add individual server to the AIOps plugin monitored groups on the associated OpenManage Enterprise again.
- Enable the EnableMetricInjection option using configuration compliance, when disabled.
Technical support and resources
- iDRAC whitepapers about Redfish Telemetry
https://downloads.dell.com/manuals/common/dell-emc-idrac9-telemetry-streaming-basics.pdf
https://downloads.dell.com/manuals/common/dell-emc-idrac9-telemetry-streaming-performance-report.pdf - iDRAC User Guides and other manuals
http://www.dell.com/idracmanuals - OpenManage Enterprise User’s Guide
https://www.dell.com/support/home/en-us/product-support/product/dell-openmanage-enterprise/docs - OpenManage Enterprise AIOps Observability Plugin User’s Guide
Support for OpenManage Enterprise APEX AIOps Observability - AIOps Observability whitepaper
https://www.delltechnologies.com/asset/en-us/products/storage/industry-market/h15691-emc-AIOpsObservability-overview.pdf - Dell Technical Support
http://www.dell.com/support
Appendix
Licenses and metrics for PowerEdge servers in AIOps Observability
iDRAC License Type | iDRAC firmware | OpenManage Enterprise License Type | Basic Metrics* | Advanced Metrics** |
Enterprise | 13G PowerEdge servers with iDRAC8 2.75 or later.
14G PowerEdge servers with iDRAC9 3.34 to 4.40.00. 14G or 15G or 16G PowerEdge servers with iDRAC9 4.40.10 or later. |
No license required | Yes | No |
Basic, Express, or Enterprise | 13G PowerEdge servers with iDRAC8 2.75 or later.
14G PowerEdge servers with iDRAC9 3.34 to 4.40.00. 14G or 15G or 16G PowerEdge servers with iDRAC9 4.40.10 or later*** |
OpenManage Enterprise Advanced | Yes | No |
Data Center | 14G or 15G or 16G PowerEdge servers with iDRAC9 4.40.10 or higher | No license required | Yes | Yes |
- Basic Metrics include Power, Thermal, and Central Processing Unit (CPU). 15G PowerEdge servers have different Basic Metrics based on whether it’s AMD or Intel:
- Intel model Basic Metrics include Power, Thermal, CPU, Input/Output (IO), and Memory utilization.
- AMD model Basic Metrics include Power, Thermal, and CPU.
- Advanced Metrics include Network Interface Card (NIC), Fiber Channel, Graphics Processing Unit (GPU), and Storage.
- Basic metrics using Redfish.
Supported devices:
- 13G, 14G, 15G, and 16G generations of Dell PowerEdge servers.
- Dell PowerEdge C series servers.
- Dell PowerEdge XE series.
- Dell PowerEdge XR series.
Overview of the table headers for the listed metric groups
Header Name | Description |
Metrics | Supported list of metrics. |
Collection function | Selected collection function is applied across a time interval and computes one single value. Possible values are Average, Minimum, Maximum, and Summation. |
Collection duration (Minutes) | Specifies the duration (in Minutes) over which the function is computed. |
Minimum supported platform | Minimum supported platform generation in which the listed metrics can be generated. |
Minimum iDRAC firmware version | Minimum supported iDRAC firmware version to generate the metrics listed. |
List of metrices supported by AIOps Observability
CPUSensor
Metrics | Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
Temperature Reading | Average, minimum, and maximum | 15 | 14G | 4.40.10.00 | Open Manage Enterprise- Advanced or Data Center |
SystemUsage
Metrics | Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
CPUUsage, IOUsage, MemoryUsage | Average, minimum, and maximum | 5 | 14G | 4.40.10.00 | OpenManage Enterprise- Advanced or Data Center |
SystemUsage | Average, minimum, and maximum | 5 | 14G | 4.40.10.00 | OpenManage Enterprise- Advanced/ Data Center |
FCPortStatistics
Metrics | Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
FC Invalid CRCs, FCLin kFailures, FCRx KB Count, FCTx KB Count | Maximum | 5 | 14G | 4.40.10.00 | Data Center |
GPU Metrics
GPU Statistics
Metrics | Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
DBE Retired Pages, SBE Retired Pages | Maximum | 15 | 14G | 4.40.10.00 | Data Center |
NIC Statistics
Metrics |
Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
DiscardedPkts, FCOELinkFailures, FCOEPktRxCount, FCOEPktTxCount, RDMARxTotalPackets, RDMATxTotalBytes, RDMATxTotalPackets, RxBytes, RxErrorPktFCSErrors, RxJabberPkt,
TxBytes, TxErrorPktExcessiveCollision, TxErrorPktMultipleCollision |
Maximum | 5 | 14G | 4.40.10.00 | Data Center |
NVMe SMART Data
Metrics |
Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
Available Spare Threshold, Composite Temperature, Critical Warning, Percentage Used | Maximum | 60 | 14G | 4.40.10.00 | Data Center |
Power Metrics
Metrics |
Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
Total CPU Power, Total Memory Power, Cumulative System Energy | Average, minimum, and maximum | 15 | 14G | 4.40.10.00 | OpenManage Enterprise- Advanced or Data Center |
Storage Disk SMAR TData
Metrics | Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
Command Timeout, CRC Error Count, Drive Temperature, Erase Fail Count, Exception Mode Status, | Maximum | 60 | 14G | 4.40.10.00 | Data Center |
Metrics | Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
Media Write Count, Percent Drive Life Remaining, Powe On Hours, Program Fail Count, Read Error Rate, Reallocated Block Count, Uncorrectable Error Count, Uncorrectable LBA Count,
Volatile Memory Backup Source Failures |
Thermal Metrics
Metrics |
Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
SysNet Airflow | Average, minimum, and maximum | 15 | 14G | 4.40.10.00 | OpenManage Enterprise- Advanced or Data Center |
Temperature Reading | Average, minimum, and maximum | 15 | 12G | 2.70 | Enterprise or OpenManage Enterprise- Advanced or Data Center |
iSM CPU and Memory Metrics
Metrics |
Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
OSProcessor Max Frequency, OSTotal Virtual Memory, OSProcessor Utilization Percentage
OSProcessor Operating Frequency, OS Number of Processes, OS Free Physical Memory, OS Free Virtua Memory, OS Memory Utilization Percentage |
Average, minimum, and maximum | 15 | 14G | 5.3.0 | Data Center |
OS Number of Processor Cores, OS Total Physical Memory | Maximum | 15 | 14G | 5.3.0 | Data Center |
iSM Storage Metrics
Metrics | Collection Function | Collection Duration (Mins) | Minimum Platform Supported | Minimum iDRAC FW
version |
License required |
OS Physical Drive Disk Size, O SLogica lDrive Free Space, OS Logica lDrive Total Size | Average, minimum, and maximum | 15 | 14G | 5.3.0 | Data Center |
Documents / Resources
![]() |
DELL Technologies XE9680L Featuring AI Factory Rack Scale Architecture [pdf] User Guide XE9680L Featuring AI Factory Rack Scale Architecture, XE9680L, Featuring AI Factory Rack Scale Architecture, Factory Rack Scale Architecture, Rack Scale Architecture, Scale Architecture |