DELL-LOGO

DELL Technologies XE9680L Featuring AI Factory Rack Scale Architecture

DELL Technologies -XE9680L -Featuring-AI Factory-Rack -Scale-Architecture -PRODUCT

Abstract
This technical whitepaper provides information about the iDRAC telemetry that is collected by OpenManage Enterprise and forwarded to AIOps Observability.
(formerly known as CloudIQ). The iDRAC telemetry feature is enabled by installing the AIOps plugin in OpenManage Enterprise. This enables AIOps Observability customers to view and report to various metrics associated with the system (for example, power, thermal, and utilization), and for various components—for example, Networking, Storage, and Graphics Processing Unit (GPU)—in a PowerEdge server.
December 2024

Revisions

Date Description
November 2021 Initial release
December 2024 Updated release

Acknowledgments

Authors:

  • Muralidhar Kolli, Software Principal Engineer, Enterprise Systems Management
  • Vijayasimha Naga, Software Senior Principal Engineer, Enterprise Systems Management
  • Sudhir Shetty, Distinguished Engineer, Enterprise Systems Management
  • Mahantesh Tippimath, Software Principal Engineer, Enterprise Systems Management
  • Support: Mansi Manocha, Content Engineer 2
  • The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
  • Use, copying, and distribution of any software described in this publication requires an applicable software license.
  • This document may contain certain words that are not consistent with Dell’s current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly.
  • This document may contain language from third party content that is not under Dell’s control and is not consistent with Dell’s current guidelines for Dell’s own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly.
  • Copyright © 2024 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [12/20/2024] [Technical Whitepaper] [646]

Executive summary

Open Manage Enterprise (OME), as a management console, supports discovery and management of various devices in a data center like servers, storage & network devices. This also retrieves telemetry data from supported devices to provide a consolidated view of their performance, efficiency, and utilization. This technical whitepaper provides an overview of the AIOps Observability plugin and OpenManage Enterprises underlying infrastructure that facilitates metric collection from PowerEdge servers and sends the collected telemetry data to AIOps Observability on a periodic basis.

AIOps Observability Overview

Dell PowerEdge servers, equipped with iDRAC, provide a variety of metrics using out-of-band management interfaces like WS-Man and Redfish. OpenManage Enterprise uses the iDRAC APIs to collect metrics for the PowerEdge servers and sends the collected data to AIOps Observability using Dell connectivity service.
The following diagram illustrates the flow of metrics from PowerEdge servers to AIOps Observability (formerly known as CloudIQ) using OpenManage Enterprise:

DELL Technologies -XE9680L -Featuring-AI Factory-Rack -Scale-Architecture 

The different components in OpenManage Enterprise that facilitate this functionality are described here:

  • AIOps Observability Plugin—This plugin enables OpenManage Enterprise to make necessary configurations as follows:
    • Define the groups of PowerEdge servers to be monitored in AIOps Observability.
    • Configure secure connectivity with AIOps Observability using Dell Connectivity Service.
  • Metrics Collection Service—This service manages the periodic tasks that are responsible for:
    • Configuring iDRAC on the selected servers to generate and report metrics.
    • Collecting metrics from the selected servers using WS-Man or Redfish.
  • Data Forwarding Service—This service manages the periodic tasks that are responsible to:
    • Forward the metrics collected to AIOps Observability.
    • Forward inventory, health, and alert information to AIOps Observability.

Prerequisites
iDRAC must have one of the following licenses installed for metrics to be collected using OpenManage Enterprise:

  • Enterprise License
  • OpenManage Enterprise Advanced License
  • Datacenter License
  • For more information about the list of metrics for these licenses, see Appendix.

Configure the AIOps Observability plugin in OpenManage  Enterprise

  • AIOps Observability plugin must be installed in OpenManage Enterprise and should be in active state to allow the registration of PowerEdge servers to monitor and configure OpenManage Enterprise connectivity with AIOps Observability.
  • The connection must be established with AIOps Observability using Dell connectivity service. The status of connection must be connected to maintain the continuity of metric flow to AIOps Observability.
  • After OpenManage Enterprise is configured with AIOps Observability plugin, add one or more server groups to the AIOps Observability Managed groups. This allows the metrics collection service in OpenManage Enterprise to begin metric collection from the servers connected to these managed groups. For the servers that support advanced Redfish Telemetry, the metrics collection service requires additional configurations on iDRAC to generate metric reports at specific intervals.
  • The Metrics Collection Service performs a periodic task at an interval of 15 minutes to collect metrics from the registered PowerEdge servers. For more information about the complete list of metrics and the associated collection intervals, see Appendix. The collected metrics are saved in a time-series database.
  • The Data Forwarding Service performs a task that continuously reads metric records from the time series database and transfers them in a compressed format to AIOps Observability. After the data is transferred successfully, it is purged from the OpenManage Enterprise database.
  • Note: If telemetry feature is disabled in iDRAC manually, or if the metric report definition provisioned by OpenManage Enterprise on individual servers is deleted, it results in temporary loss of data. However, metrics collection service will identify and reconfigure iDRAC to resume the generation of metrics. For more information, see the Troubleshooting section.

Monitor metric collection status in OpenManage Enterprise

Monitor the overall status of metrics collection on OpenManage Enterprise User Interface (UI) by following the steps:

  1. Click Home   Monitor  Jobs.
  2. Apply the filter by selecting the Job Type as Metrics_Task and Source as System generated.
  3. Click View Details to view the individual status of last few metrics collection cycles. This provides information about the time taken for each cycle, and if the cycle is completed successfully for all the servers, or not.
  4. To see the status of metric collection for each individual server within a cycle, click individual rows. This provides information about the time taken for collecting all the supported metrics for the server and a summary of the number of metrics collected.

The number of metric samples collected for each server may vary and it depends on the licenses installed on the device, and its hardware inventory.

Monitor metric data transfer to AIOps Observability in  OpenManage Enterprise
Monitor the status of metric data transfer to AIOps Observability on OpenManage Enterprise by performing the following:

  1. Click Home Plugins  AIOps Observability  Transfer Log.
  2. Apply the filter by selecting Category as the telemetry type.
  3. Click individual rows to view the status of each metric transfer.

This provides information about the name of the compressed file, its size, and the time taken to transfer the file successfully.

View metrics on AIOps Observability

The Overview page on AIOps Observability UI displays a consolidated view of the systems that includes PowerEdge servers that are monitored using OpenManage Enterprise.

  1. Click Overview  Performance   System Performance.
  2. Click Server to view the Thermal & System utilization metric summary of individual servers.
  3. To view the detailed individual performance metrics, click on one of the servers—This provides graphs for each of the metrics with a summary of average, minimum, and maximum values reached in the last 24 hours. It also provides the links to go to related hardware inventory (for example, processors, and memory).
  4. To create custom reports for other metrics, click Overview Page   Reports   Report Browser.
  5. Click Add Content and select the custom report.
  6. From the Product list, select PowerEdge option. This provides the list of metric categories.
  7. Select either of Line Chart or Table format for graphical representation of the custom report.
    Note: All metrics may not support both Line Chart and Table view formats. For more information, see Onlinehelp documentation on Dell AIOps Observability portal.
  8.  Select a category from System list, displays the list of related metrics that can be selected along with the applicable component(s) to generate a custom report.
  9. Based upon the selected number of metrics and components, one or more graphs will appear under the reports.

Troubleshooting

Data transfer failures
If the system performance details or other metrics are missing on Dell AIOps Observability portal for any of the PowerEdge servers that are monitored using OpenManage Enterprise, ensure that you perform the following steps:

  1. Verify that the criteria described in Configuring AIOps Observability Plugin in OpenManage Enterprise is met.
  2. Check the connection status of OpenManage Enterprise with AIOps Observability.
  3. If the status is shown as Disconnected and is in Amber or Red for an extended period, contact technical support.
  4. Check the transfer logs as described in monitoring metric data transfer to AIOps Observability in the OpenManage Enterprise User’s guide available on the Dell support site.

If the errors are because of connection failure or intermittent service failures, then after restoration, ensure that you perform subsequent metric transfers to transmit the accumulated backlog of data.

Metrics collection failures
If the status of the connection between OpenManage Enterprise and AIOps Observability is successful, and there are no failures in data transfer, then there could potentially be failures in metric collection. The scenarios and recommended actions are described as follows:

  1. Scenario 1: Connection failure
    Recommended Action: Ensure that the required device is powered on and detectable.
  2. Scenario 2: Authentication failure
    Recommended Action: Re-run the discovery of PowerEdge servers in OpenManage Enterprise using latest iDRAC credentials.
  3. Scenario 3: Missing, invalid, or expired license
    Recommended Action: Reload the valid license. For more information about licenses, see the Appendix.
  4. Scenario 4: Incomplete metric retrieval
    Recommended Action: If a server that supports Redfish Telemetry is newly registered for metric collection in OpenManage Enterprise, and if the metric collection is performed before the metrics collection service, then it results in the configuration of metric report definitions on iDRAC. If the Metric report definition is deleted manually on the server, results in errors in OpenManage Enterprise metric collection. However, the basic metrics available using WS-Man can be retrieved during that cycle. OpenManage Enterprise will try to automatically re-provision the metric report definitions. If successful, the next metrics collection cycle should retrieve the full set of metrics.
  5. Scenario 5: Telemetry disabled
    Recommended Action: If the Telemetry feature is disabled in iDRAC because of a factory reset or if it is manually disabled with the direct access to iDRAC, results in errors in OpenManage Enterprise. In such cases, OpenManage Enterprise will automatically enable Telemetry. If successful, the next metrics collection cycle should run without errors.
  6. Scenario 6: iSM metrics not seen on AIOps Observability

Recommended Actions:

  1. Remove and add individual server to the AIOps plugin monitored groups on the associated OpenManage Enterprise again.
  2. Enable the EnableMetricInjection option using configuration compliance, when disabled.

Technical support and resources

Appendix

Licenses and metrics for PowerEdge servers in AIOps Observability 

iDRAC License Type iDRAC firmware OpenManage Enterprise License Type Basic Metrics* Advanced Metrics**
Enterprise 13G PowerEdge servers with iDRAC8 2.75 or later.

14G PowerEdge servers with iDRAC9 3.34 to 4.40.00.

14G or 15G or 16G PowerEdge servers with iDRAC9 4.40.10 or later.

No license required Yes No
Basic, Express, or Enterprise 13G PowerEdge servers with iDRAC8 2.75 or later.

14G PowerEdge servers with iDRAC9 3.34 to 4.40.00.

14G or 15G or 16G PowerEdge servers with iDRAC9 4.40.10 or later***

OpenManage Enterprise Advanced Yes No
Data Center 14G or 15G or 16G PowerEdge servers with iDRAC9 4.40.10 or higher No license required Yes Yes
  • Basic Metrics include Power, Thermal, and Central Processing Unit (CPU). 15G PowerEdge servers have different Basic Metrics based on whether it’s AMD or Intel:
  • Intel model Basic Metrics include Power, Thermal, CPU, Input/Output (IO), and Memory utilization.
  • AMD model Basic Metrics include Power, Thermal, and CPU.
  • Advanced Metrics include Network Interface Card (NIC), Fiber Channel, Graphics Processing Unit (GPU), and Storage.
  • Basic metrics using Redfish.

Supported devices:

  • 13G, 14G, 15G, and 16G generations of Dell PowerEdge servers.
  • Dell PowerEdge C series servers.
  • Dell PowerEdge XE series.
  • Dell PowerEdge XR series.

Overview of the table headers for the listed metric groups

Header Name Description
Metrics Supported list of metrics.
Collection function Selected collection function is applied across a time interval and computes one single value. Possible values are Average, Minimum, Maximum, and Summation.
Collection duration (Minutes) Specifies the duration (in Minutes) over which the function is computed.
Minimum supported platform Minimum supported platform generation in which the listed metrics can be generated.
Minimum iDRAC firmware version Minimum supported iDRAC firmware version to generate the metrics listed.

List of metrices supported by AIOps Observability

CPUSensor

Metrics Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
Temperature Reading Average, minimum, and maximum 15 14G 4.40.10.00 Open Manage Enterprise- Advanced or Data Center

SystemUsage

Metrics Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
CPUUsage, IOUsage, MemoryUsage Average, minimum, and maximum 5 14G 4.40.10.00 OpenManage Enterprise- Advanced or Data Center
SystemUsage Average, minimum, and maximum 5 14G 4.40.10.00 OpenManage Enterprise- Advanced/ Data Center

FCPortStatistics

Metrics Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
FC Invalid CRCs, FCLin kFailures, FCRx KB Count, FCTx KB Count Maximum 5 14G 4.40.10.00 Data Center

GPU Metrics

GPU Statistics

Metrics Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
DBE Retired Pages, SBE Retired Pages Maximum 15 14G 4.40.10.00 Data Center

NIC Statistics

 

Metrics

Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
DiscardedPkts, FCOELinkFailures, FCOEPktRxCount, FCOEPktTxCount, RDMARxTotalPackets, RDMATxTotalBytes, RDMATxTotalPackets, RxBytes, RxErrorPktFCSErrors, RxJabberPkt,

TxBytes, TxErrorPktExcessiveCollision, TxErrorPktMultipleCollision

Maximum 5 14G 4.40.10.00 Data Center

NVMe SMART Data

 

Metrics

Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
Available Spare Threshold, Composite Temperature, Critical Warning, Percentage Used Maximum 60 14G 4.40.10.00 Data Center

Power Metrics 

 

Metrics

Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

 

License required

Total CPU Power, Total Memory Power, Cumulative System Energy Average, minimum, and maximum 15 14G 4.40.10.00 OpenManage Enterprise- Advanced or Data Center

Storage Disk SMAR TData

Metrics Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
Command Timeout, CRC Error Count, Drive Temperature, Erase Fail Count, Exception Mode Status, Maximum 60 14G 4.40.10.00 Data Center
Metrics Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
Media Write Count, Percent Drive Life Remaining, Powe On Hours, Program Fail Count, Read Error Rate, Reallocated Block Count, Uncorrectable Error Count, Uncorrectable LBA Count,

Volatile Memory Backup Source  Failures

Thermal Metrics 

 

Metrics

Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

 

License required

SysNet Airflow Average, minimum, and maximum 15 14G 4.40.10.00 OpenManage Enterprise- Advanced or Data Center
Temperature Reading Average, minimum, and maximum 15 12G 2.70 Enterprise or OpenManage Enterprise- Advanced or Data Center

iSM CPU and Memory Metrics

 

Metrics

Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

 

License required

OSProcessor Max Frequency, OSTotal Virtual Memory, OSProcessor Utilization Percentage

OSProcessor Operating Frequency, OS Number of Processes, OS Free Physical Memory, OS Free Virtua Memory, OS Memory Utilization Percentage

Average, minimum, and maximum 15 14G 5.3.0 Data Center
OS Number of Processor Cores, OS Total Physical Memory Maximum 15 14G 5.3.0 Data Center

iSM Storage Metrics

Metrics Collection Function Collection Duration (Mins) Minimum Platform Supported Minimum iDRAC FW

version

License required
OS Physical Drive Disk Size, O SLogica lDrive Free Space, OS Logica lDrive Total Size Average, minimum, and maximum 15 14G 5.3.0 Data Center

Documents / Resources

DELL Technologies XE9680L Featuring AI Factory Rack Scale Architecture [pdf] User Guide
XE9680L Featuring AI Factory Rack Scale Architecture, XE9680L, Featuring AI Factory Rack Scale Architecture, Factory Rack Scale Architecture, Rack Scale Architecture, Scale Architecture

References

Leave a comment

Your email address will not be published. Required fields are marked *