Performance Tuning for Cisco UCS M8 Platforms with AMD EPYC 4th Gen and 5th Gen Processors
White paper
Cisco public
Document purpose and scope
The Basic Input-Output System (BIOS) initializes hardware and boots the operating system. BIOS settings control system behavior, with some directly impacting performance. This document outlines BIOS settings for Cisco UCS M8 servers with AMD EPYC 4th and 5th Gen processors, focusing on optimizing performance and energy efficiency for Cisco UCS X215c M8 Compute Nodes, Cisco UCS C245 M8 Rack Servers, and Cisco UCS C225 M8 Rack Servers. It also discusses BIOS settings for various workloads on these servers. The settings provided are generic and not specific to particular firmware releases.
What you will learn
This document guides users through system BIOS performance settings, offering suggestions to achieve optimal performance on Cisco UCS M8 servers with 4th and 5th Gen AMD EPYC CPUs. It aims to demystify BIOS options, helping users balance power savings and performance.
AMD EPYC 9004 Series processors
The AMD EPYC 9004 Series processors feature Zen 4 cores and AMD Infinity architecture. They integrate compute cores, memory controllers, I/O controllers, Reliability, Availability, and Serviceability (RAS), and security features into a System on a Chip (SoC). This series utilizes a Multi-Chip Module (MCM) Chiplet architecture, enhancing the SoC components. The architecture includes Core Complex Dies (CCDs) containing Core Complexes (CCXs), which house the Zen 4-based cores. The Zen 4 core, built on a 5nm process, offers improved Instructions Per Cycle (IPC) and frequency over previous generations, with enhanced L2 cache effectiveness. Each core supports Simultaneous Multithreading (SMT), allowing two hardware threads to run independently. A Core Complex (CCX) supports up to eight Zen 4 cores sharing an L3 cache. With SMT enabled, a single CCX can support up to 16 concurrent hardware threads.
These processors incorporate AMD 3D V-Cache die-stacking technology for improved chiplet integration and up to 96MB of L3 cache per die. The industry-leading logic stacking enables high interconnect densities, leading to lower latency, higher bandwidth, and better power/thermal efficiency. The CCDs connect to memory, I/O, and each other via an updated I/O Die (IOD) using AMD Infinity Fabric. The IOD supports up to 4 xGMI (or G-links) with speeds up to 32Gbps and exposes DDR5 memory channels, PCIe Gen5, CXL 1.1+, and Infinity Fabric links. Each IOD provides twelve Unified Memory Controllers (UMCs) supporting DDR5 memory. Each UMC supports up to 2 DIMMs per channel, allowing for up to 24 DIMMs per socket. 4th Gen AMD EPYC processors support up to 6TB of DDR5 memory per socket, offering increased memory bandwidth. Memory interleaving across 2, 4, 6, 8, 10, and 12 channels optimizes performance for various workloads. Processors feature 4 P-links and 4 G-links, with G-links usable for connecting to a second processor or providing additional PCIe Gen5 lanes. 4th Gen AMD EPYC processors support up to 128 lanes of PCIe Gen5 in single-socket and up to 160 lanes in dual-socket configurations.
| Item | Specification |
|---|---|
| Cores process technology | 5-nanometer (nm) Zen 4 |
| Maximum number of cores | 128 |
| Maximum memory speed | 4800 Mega-Transfers per second (MT/s) |
| Maximum memory channels | 12 per socket |
| Maximum memory capacity | 6 TB per socket |
| PCI | 128 lanes (maximum) for 1-socket |
| PCIe Gen 5 | 160 lanes (maximum) for 2-socket |
For more information, refer to the Overview of AMD EPYC 9004 Series Processors Microarchitecture.
AMD EPYC 9005 Series processors
5th Gen AMD EPYC processors support IT initiatives for data-center consolidation and modernization, catering to demanding enterprise applications. They enable AI expansion, improve energy efficiency, and support high-density virtualization and cloud environments. These processors deliver significant uplifts in instruction-per-clock-cycle (IPC) performance, particularly for ML, HPC, and enterprise workloads, with the efficiency-optimized Zen 5c core powering CPUs with the highest core counts for virtualized and cloud workloads. The hybrid, multichip architecture allows for decoupled innovation paths. The Zen 5 and Zen 5c cores represent advancements with new support for complex machine-learning and inferencing applications.
Zen 5 core: Optimized for high performance, with up to eight cores forming a core complex (CCX) featuring a 32-MB shared L3 cache. Up to 16 CCDs can be configured into an EPYC 9005 processor, supporting up to 128 cores in the SP5 form factor. Compared to the previous generation, Zen 5 cores offer a 20 percent greater integer and 34 percent higher floating-point performance in 64-core processors within the same 360W TDP range.
Zen 5c core: Optimized for density and efficiency, sharing register-transfer logic with the Zen 5 core but with a smaller physical footprint for improved performance per watt. The Zen 5c core complex includes up to 16 cores and a shared 32-MB L3 cache. Up to 12 CCDs can be combined with an I/O CCD to deliver CPUs with up to 192 cores in an SP5 form factor.
| Item | Specification |
|---|---|
| Cores process technology | 4-nanometer (nm) Zen 5 and 3-nanometer Zen 5c |
| Maximum number of cores | 192 |
| Maximum L3 cache | 512 MB |
| Maximum memory speed | 6000 Mega-Transfers per second (MT/s) |
| Maximum memory channels | 12 per socket |
| Maximum memory capacity | 6 TB per socket |
| PCI | 128 lanes (max.) for 1-socket |
| PCIe Gen 5 | 160 lanes (max.) for 2-socket |
Note: Cisco UCS M8 platforms support only up to 160 cores 400W TDP of Zen 5c processors.
For more information, refer to the Overview of AMD EPYC 9005 Series Processors Microarchitecture.
Non-Uniform Memory Access (NUMA) topology
AMD EPYC 9004 and 9005 Series processors utilize a Non-Uniform Memory Access (NUMA) architecture, where memory access latency varies based on proximity to processor cores and I/O controllers. Utilizing resources within the same NUMA node ensures good performance, while cross-node access increases latency. The system's NUMA Nodes Per Socket (NPS) BIOS setting can be adjusted to optimize this topology for specific operating environments and workloads. For example, NPS=4 divides the processor into four quadrants, each with 3 CCDs, 3 UMCs, and 1 I/O hub. Proximity within a quadrant offers the shortest processor-memory I/O distance. The cross-diagonal or cross-socket distance is the furthest. Core, memory, and I/O hub locality within a NUMA system is crucial for performance tuning.
Figure 1. AMD EPYC 4th Gen processor block diagram with NUMA domains
Optimizations in 4th Gen EPYC processors' Infinity Fabric interconnects have further reduced latency. For applications requiring minute latency improvements, creating affinity between memory ranges and CPU dies (Zen 4 or Zen 4c) can boost performance. Figure 1 illustrates this: dividing the I/O die into four quadrants for NPS=4 configuration shows six DIMMs feeding three memory controllers, closely connected via Infinity Fabric (GMI) to up to three Zen 4 CPU dies (or 24 CPU cores).
Figure 2. AMD EPYC 5th Gen processor block diagram with NUMA domains
Improvements in 5th Gen EPYC processors' AMD Infinity Fabric interconnects have further reduced latency. For applications needing marginal latency gains, establishing affinity between memory ranges and CPU dies (Zen 5 or Zen 5c) can enhance performance. Figure 2 demonstrates this: dividing the I/O die into four quadrants for NPS=4 configuration shows six DIMMs feeding three memory controllers, connected via Infinity Fabric (GMI) to up to four Zen 5 CPU dies (or three Zen 5c CPU dies).
NPS1: Configures all memory channels into a single NUMA node, encompassing all processor cores, memory, and PCIe devices. Memory is interleaved across all channels into a single address space.
NPS2: Divides the processor into two NUMA domains, each containing half the cores and memory channels. Memory is interleaved across the six memory channels within each domain. PCIe devices are local to the NUMA node containing their root complex. This setting reports two NUMA nodes per socket.
NPS4: Partitions the processor into four NUMA nodes per socket, with each logical quadrant acting as a NUMA domain. Memory is interleaved across the memory channels of each quadrant. PCIe devices are local to the NUMA domain with their root complex. This configuration interleaves every pair of memory channels and is recommended for HPC and highly parallel workloads. NPS4 is required for booting Windows systems with CPU SMT enabled on processors with over 64 cores, as Windows limits CPU groups to 64 logical cores.
Note: For Windows systems, ensure the number of logical processors per NUMA node is <=64 by using NPS2 or NPS4 instead of the default NPS1.
NPS0 (not recommended)
A setting of NPS=0 creates a single NUMA domain for the entire system, interleaving memory across all channels into one address space. All processor cores, memory, and PCIe devices across all sockets reside within this single NUMA domain.
Layer 3 cache as NUMA Domain
The Layer 3 Cache as NUMA (L3CAN) BIOS option exposes each Layer-3 cache (one per CCD) as its own NUMA node. For instance, a single processor with 8 CCDs would have 8 NUMA nodes. A two-socket system would have 16 NUMA nodes. This setting can improve performance for NUMA-optimized workloads by pinning them to cores within a CCX and leveraging shared L3 cache. When disabled, NUMA domains follow the NPS setting.
Processor settings
This section details configurable processor options.
CPU SMT Mode
The CPU Simultaneous Multithreading (CPU SMT) option allows enabling or disabling logical processor cores. When set to Auto (enabled), each physical core acts as two logical cores, facilitating multithreaded applications. For some workloads, including HPC, CPU SMT can yield neutral or negative performance. Disabling CPU SMT might be beneficial, especially if the operating system lacks x2APIC support for over 255 threads. Testing CPU hyperthreading (enabled/disabled) in your specific environment is recommended. Disable hyperthreading for single-threaded applications.
| Setting | Options |
|---|---|
| CPU SMT control |
|
Secure Virtual Machine (SVM) mode
The Secure Virtual Machine (SVM) mode enables processor virtualization features, allowing the platform to run multiple operating systems in independent partitions. SVM mode can be set to Enabled or Disabled. If virtualization is not required, disable AMD virtualization technology and the AMD IOMMU option to avoid latency differences in memory access.
| Setting | Options |
|---|---|
| SVM |
|
DF C-states
The AMD Infinity Fabric can enter low-power states when idle, but transitioning back to full power may cause latency jitter. For low-latency or bursty I/O workloads, disabling Data Fabric (DF) C-states can improve performance at the cost of higher power consumption.
| Setting | Options |
|---|---|
| DF C-states |
|
ACPI SRAT L3 Cache as NUMA Domain
When the ACPI SRAT L3 Cache as NUMA Domain setting is enabled, each Layer-3 cache (one per CCD) is exposed as a NUMA node. This can enhance performance for NUMA-optimized workloads if they can be pinned to cores within a CCX and benefit from shared L3 cache. When disabled, NUMA domains are identified by the NUMA NPS setting. Some operating systems and hypervisors may not perform Layer 3-aware scheduling, while others benefit from Layer 3 being declared as a NUMA domain.
| Setting | Options |
|---|---|
| ACPI SRAT L3 Cache As NUMA Domain |
|
Algorithm Performance Boost Disable (APBDIS)
The APBDIS setting controls the Algorithm Performance Boost (APB) for the SMU. By default, the AMD Infinity Fabric dynamically switches between full-power and low-power fabric and memory clocks based on usage. In latency-sensitive scenarios, this transition can cause adverse latency effects. Setting APBDIS to 1 and a fixed Infinity Fabric P-state of 0 forces full-power mode, eliminating latency jitter. Setting a fixed Infinity Fabric P-state of 1 may reduce memory latency at the cost of memory bandwidth, benefiting latency-sensitive applications.
| Setting | Options |
|---|---|
| APBDIS |
|
Fixed SOC P-State SP5F 19h
This setting determines the SOC P-State (independent or dependent), as reported by the ACPI _PSD object, and changes the SOC P-State when APBDIS is enabled. 'F' refers to the processor family.
| Setting | Options |
|---|---|
| Fixed SOC P-State SP5F 19h |
|
xGMI settings: connection between sockets
In two-socket systems, processors are interconnected via socket-to-socket xGMI links, part of the Infinity Fabric. NUMA-unaware workloads may require maximum xGMI bandwidth for cross-socket communication. NUMA-aware workloads might prefer to minimize xGMI power, potentially reducing cross-socket traffic and utilizing increased CPU boost. xGMI lane width can be reduced from x16 to x8 or x2, or an xGMI link can be disabled to conserve power.
xGMI link configuration and 4-link xGMI max speed (Cisco xGMI max Speed)
The number of xGMI links and maximum speed can be configured. Lowering the speed can save uncore power, potentially increasing core frequency or reducing overall power, but it decreases cross-socket bandwidth and increases latency. Cisco UCS C245 M8 Rack Server supports four xGMI links with a maximum speed of 32 Gbps. Enabling Cisco xGMI max speed sets xGMI Link Configuration to 4 and 4-Link xGMI Max Speed to 32 Gbps. Disabling it applies default values.
| Setting | Options |
|---|---|
| Cisco XGMI Max Speed |
|
| xGMI Link Configuration |
|
| 4-Link xGMI Max Speed |
|
| 3-Link xGMI Max Speed |
|
Note: This BIOS feature applies only to Cisco UCS X215c M8 Compute Nodes and Cisco UCS C245 M8 Rack Servers with 2-socket configurations.
Enhanced CPU performance
This BIOS option allows users to adjust enhanced CPU performance settings. When enabled, it optimizes processor settings for aggressive operation, potentially improving overall CPU performance but increasing power consumption. Values can be Auto or Disabled. By default, this option is disabled.
Note: This BIOS feature applies only to Cisco UCS X215c M8 Compute Nodes and Cisco UCS C245 M8 Rack Servers. When enabled, setting the fan policy to maximum power is highly recommended. By default, this BIOS setting is Disabled.
Memory settings
This section covers memory configuration options.
NUMA Nodes Per Socket (NPS)
The NPS setting specifies the number of NUMA Nodes Per Socket, balancing local memory latency for NUMA-aware workloads against per-core memory bandwidth for non-NUMA-friendly workloads. Socket interleave (NPS0) attempts to interleave two sockets into one NUMA node. 4th Gen AMD EPYC processors support various NPS values depending on internal topology. NPS2 and NPS4 might not be available on all processors or memory configurations. For single-socket servers, NPS can be 1, 2, or 4. Performance for NUMA-optimized applications can improve with NPS values greater than 1. The default configuration (one NUMA Domain per socket) is recommended for most workloads. NPS4 is recommended for High-Performance Computing (HPC) and highly parallel workloads. For 200-Gbps network adapters, NPS2 may offer a balance between memory latency and bandwidth for the Network Interface Card (NIC). This setting is independent of the ACPI SRAT L3 Cache as NUMA Domain setting. When ACPI SRAT L3 Cache as NUMA Domain is enabled, this setting determines memory interleaving granularity. With NPS1, all eight memory channels are interleaved. With NPS2, every four channels are interleaved. With NPS4, every pair of channels is interleaved.
| Setting | Options |
|---|---|
| NUMA Nodes per Socket |
|
I/O Memory Management Unit (IOMMU)
The I/O Memory Management Unit (IOMMU) provides benefits and is required for x2 programmable interrupt controller (x2APIC). Enabling IOMMU allows devices like the EPYC integrated SATA controller to issue separate interrupt requests (IRQs) for each device, instead of one IRQ for the subsystem. IOMMU also enhances operating system protection for Direct Memory Access (DMA)-capable I/O devices and helps filter and remap interrupts from peripheral devices.
| Setting | Options |
|---|---|
| IOMMU |
|
Memory interleaving
Memory interleaving increases memory bandwidth by reading consecutive memory blocks from different memory banks, preventing wait times for memory transfers. AMD recommends populating all eight memory channels per CPU socket with equal capacity for optimal performance in eight-way interleaving mode.
| Setting | Options |
|---|---|
| Memory interleaving |
|
Power settings
This section covers power state settings.
Core performance boost
The Core performance boost feature allows the processor to exceed its base frequency based on power, thermal headroom, and active cores. This can cause jitter due to frequency transitions. For workloads not requiring maximum core frequency, setting a maximum core boost frequency can improve power efficiency. This setting limits the maximum boost frequency, not sets a fixed frequency. Actual boost performance depends on various factors and other settings.
| Setting | Options |
|---|---|
| Core performance boost |
|
Global C-state control
C-states are processor core inactive power states, with CO being the operational state and higher C-states being low-power idle states. Global C-state control enables or disables C-states. Auto (enabled) allows cores to enter lower power states, which can cause jitter due to frequency transitions. Disabled forces CPU cores to operate at CO and C1 states. C-states are exposed via ACPI objects and can be requested by software. The 4th Gen AMD EPYC processor core supports I/O-based C0, C1, and C2 states.
| Setting | Options |
|---|---|
| Global C-state control |
|
Layer-1 and Layer-2 stream hardware prefetchers
Layer-1 and Layer-2 stream hardware prefetchers (L1 Stream HW Prefetcher and L2 Stream HW Prefetcher) gather data to keep the core pipeline busy. Most workloads benefit from these, but some random workloads perform better with one or both disabled. By default, both are enabled.
| Setting | Options |
|---|---|
| L1 Stream HW Prefetcher |
|
| L2 Stream HW Prefetcher |
|
Determinism slider
The Determinism slider allows selection between uniform performance across identically configured systems (Performance setting) or maximum individual system performance with potential variation across the data center (Power setting). For the Performance setting, ensure configurable Thermal Design Power (cTDP) and Package Power Limit (PPL) are set to the same value. The default Auto setting typically favors Performance mode, allowing lower power operation with consistent performance. For maximum performance, set the Determinism slider to Power.
| Setting | Options |
|---|---|
| Determinism slider |
|
CPPC: Collaborative Processor Performance Control
Collaborative Processor Performance Control (CPPC), introduced with ACPI 5.0, facilitates communication of performance between the operating system and hardware. It allows the OS to control turbo boost for energy efficiency. Not all operating systems support CPPC; Microsoft added support starting with Windows 2016.
| Setting | Options |
|---|---|
| CPPC |
|
Power profile selection F19h
The DF P-state selection in the profile policy is overridden by the P-state range, BIOS option, or APB_DIS BIOS option. 'F' denotes the processor family and 'M' denotes the model.
| Settings | Options |
|---|---|
| Power profile selection F19h |
|
Fan control policy
Fan policy allows control over fan speed to reduce server power consumption and noise. Previously, fan speed increased automatically when component temperature exceeded a threshold. To maintain low fan speeds, thresholds were set high, which suited most configurations but didn't address specific needs. For maximum CPU performance, CPUs require substantial cooling below the threshold temperature, leading to high fan speeds, increased power consumption, and noise. For low power consumption, fans run slowly, potentially causing overheating; moderate speeds are needed. The available fan policies are:
- Balanced: Default policy, suitable for most configurations but may not be ideal for servers with easily overheating PCIe cards.
- Low Power: Suitable for minimal-configuration servers without PCIe cards.
- High Power: For configurations requiring fan speeds from 60-85 percent, suitable for servers with easily overheating PCIe cards. Minimum fan speed varies by platform but is approximately 60-85 percent.
- Maximum Power: For configurations requiring extremely high fan speeds (70-100 percent), suitable for servers with easily overheating PCIe cards. Minimum fan speed varies by platform but is approximately 70-100 percent.
- Acoustic: Reduces fan speed for noise-sensitive environments. May cause short-term throttling for reduced noise, potentially impacting performance transiently.
Note: This policy is configurable for standalone Cisco UCS C-Series M8 servers via the Cisco Integrated Management Controller (IMC) console and supervisor. For Cisco Intersight-managed C-Series M8 servers, it's configurable through fan policies.
BIOS settings for Cisco UCS X215c M8 Compute Nodes, Cisco UCS C245 M8 Rack Servers, and Cisco UCS C225 M8 Rack Servers
Table 17 lists BIOS token names, defaults, and supported values for Cisco UCS M8 servers with AMD EPYC 4th and 5th Gen processor families.
| BIOS token name | Default value | Supported values |
|---|---|---|
| Processor | ||
| CPU SMT mode | Auto (enabled) | Auto, Enabled, Disabled |
| SVM mode | Enabled | Enabled, Disabled |
| DF C-states | Auto (enabled) | Auto, Enabled, Disabled |
| ACPI SRAT L3 Cache as NUMA Domain | Auto (disabled) | Auto, Enabled, Disabled |
| APBDIS | Auto (0) | Auto, 0, 1 |
| Fixed SOC P-State SP5F 19h | P0 | P0, P1, P2 |
| 4-link xGMI max speed* | Auto (32Gbps) | Auto, 20Gbps, 25Gbps, 32Gbps |
| Enhanced CPU performance* | Disabled | Auto, Disabled |
| Memory | ||
| NUMA nodes per socket | Auto (NPS1) | Auto, NPS0, NPS1, NPS2, NPS4 |
| IOMMU | Auto (enabled) | Auto, Enabled, Disabled |
| Memory interleaving | Auto (enabled) | Auto, Enabled, Disabled |
| Power/performance | ||
| Core performance boost | Auto (enabled) | Auto, Disabled |
| Global C-state control | Disabled | Auto, Enabled, Disabled |
| L1 Stream HW Prefetcher | Auto (enabled) | Auto, Enabled, Disabled |
| L2 Stream HW Prefetcher | Auto (enabled) | Auto, Enabled, Disabled |
| Determinism slider | Auto (power) | Auto, Power, Performance |
| CPPC | Auto (disabled) | Auto, Disabled, Enabled |
| Power profile selection F19h | High-performance mode | Balanced memory performance mode, efficiency mode, high-performance mode, maximum I/O performance mode, balanced core performance mode, balanced core memory performance mode |
BIOS recommendations for various general-purpose workloads
This section summarizes recommended BIOS settings for optimizing general-purpose workloads, categorized as:
- Computation-intensive
- I/O-intensive
- Energy efficiency
- Low latency
CPU intensive workloads
For CPU-intensive workloads, the goal is to distribute work across multiple CPUs to minimize processing time. This involves running job portions in parallel, with CPUs exchanging information rapidly. These workloads benefit from processors or memory achieving maximum turbo frequency, with power management settings aiding frequency increases. Optimizations focus on increasing processor core and memory speed.
I/O-intensive workloads
I/O-intensive optimizations focus on maximizing throughput between I/O and memory. Processor utilization-based power management features affecting links between I/O and memory are disabled.
Energy-efficient workloads
Energy-efficient optimizations are common balanced settings that benefit most workloads while enabling power management settings with minimal performance impact. Settings applied enhance general application performance rather than prioritizing power efficiency. Processor power management settings can affect performance with virtualization operating systems. These are recommended for users who do not typically tune BIOS settings.
Low-latency workloads
Workloads requiring low latency, such as financial trading and real-time processing, demand consistent system response and minimal computational latency. Maximum speed and throughput are often sacrificed for lower latency. Processor power management and other features that might introduce latency are disabled. Achieving low latency requires understanding system hardware configuration, including core count, threads per core, NUMA nodes, CPU/memory arrangements, and cache topology. BIOS options are generally OS-independent, but a tuned low-latency operating system is also necessary for deterministic performance.
| BIOS options | CPU intensive | I/O intensive | Energy efficiency | Low latency |
|---|---|---|---|---|
| Processor | ||||
| CPU SMT mode | Auto (enabled) | Auto | Auto | Disabled |
| SVM mode | Enabled | Enabled | Enabled | Disabled |
| DF C-states | Auto (enabled) | Auto | Disabled | Disabled |
| ACPI SRAT L3 Cache as NUMA Domain | Auto (disabled) | Enabled | Auto | Auto |
| APBDIS | Auto (0) | 1 | Auto | Auto |
| Fixed SOC P-State SP5F 19h | P0 | P0 | P2 | P0 |
| 4-link xGMI max speed | Auto (32Gbps) | Auto | Auto | Auto |
| Enhanced CPU performance | Disabled | Auto | Disabled | Disabled |
| Memory | ||||
| NUMA nodes per socket | Auto (NPS1) | NPS4 | NPS4 | Auto |
| IOMMU | Auto (enabled) | Auto* | Auto | Auto |
| Memory interleaving | Auto (enabled) | Auto* | Auto | Auto |
| Power/performance | ||||
| Core performance boost | Auto (enabled) | Auto | Auto | Disabled |
| Global C-State control | Disabled | Disabled | Enabled | Disabled |
| L1 Stream HW Prefetcher | Auto (enabled) | Auto | Auto | Disabled |
| L2 Stream HW Prefetcher | Auto (enabled) | Auto | Auto | Disabled |
| Determinism slider | Auto (power) | Auto | Auto | Performance |
| CPPC | Auto (disabled) | Auto | Auto | Enabled |
| Power profile selection F19h | High-performance mode | High-performance mode | Maximum I/O performance mode | Efficiency mode |
Note: BIOS tokens with * highlighted are applicable only for Cisco UCS X215c M8 Compute Nodes and Cisco UCS C245 M8 Rack Servers.
If your application scenario does not require virtualization, disable AMD virtualization technology. With virtualization disabled, also disable the AMD IOMMU option, as it can cause latency differences in memory access. See the AMD performance tuning guide for more information.
Additional BIOS recommendations for enterprise workloads
This section provides optimal BIOS settings for enterprise workloads, including:
- Virtualization
- Containers
- Relational Database (RDBMS)
- Analytical Database (Bigdata)
- HPC workloads
Virtualization workloads
AMD Virtualization Technology offers manageability, security, and flexibility for IT environments using software-based virtualization. It allows a single server to be partitioned into multiple independent servers, running different applications simultaneously. Enabling AMD Virtualization Technology in the BIOS is crucial for supporting virtualization workloads. CPUs supporting hardware virtualization enable running multiple operating systems in virtual machines, though this incurs some overhead compared to native OS performance. For more information, see AMD's VMware vSphere Tuning Guide.
Container workloads
Containerizing applications abstracts infrastructure and OS differences for efficiency. Each container includes an entire runtime environment, application dependencies, libraries, and configuration files. Production environments require management for consistent uptime, with automatic container restarts if one fails. Workloads that scale well on bare metal should exhibit similar scaling in a container environment with minimal performance overhead. Large overhead often indicates suboptimal application settings or container configuration. CPU load balancing by Kubernetes or other schedulers may differ from bare metal environments. For more information, see AMD's Kubernetes Container Tuning Guide.
Relational Database workloads
Integrating RDBMS like Oracle, MySQL, PostgreSQL, or Microsoft SQL Server with AMD EPYC processors can enhance database performance, particularly in high-concurrency, rapid query processing, and efficient resource utilization environments. The AMD EPYC processor architecture effectively leverages multiple cores and threads, benefiting transactional workloads, analytics, and large-scale data processing. Using AMD EPYC processors in RDBMS environments can significantly improve performance, scalability, and cost-efficiency. 4th Gen AMD EPYC processors deliver high Input/Output Operations Per Second (IOPS) and throughput for all databases. Selecting the right CPU is vital for optimal database application performance. For more information, see AMD's RDBMS Tuning Guide.
Big Data Analytics workloads
Big Data Analytics involves examining vast data to uncover patterns, correlations, and insights for better decision-making. This requires significant computational power, memory capacity, and I/O bandwidth, areas where AMD EPYC processors excel. AMD EPYC processors offer a robust platform for Big Data Analytics, providing the necessary computational power, memory capacity, and I/O bandwidth for large-scale data processing. Their scalability, cost efficiency, and energy efficiency make them a compelling choice for organizations building or upgrading their Big Data Analytics infrastructure.
HPC (High-performance computing) workloads
HPC refers to cluster-based computing using multiple interconnected nodes to process large datasets faster than single systems. HPC workloads are computation and network-I/O intensive, requiring high-quality CPU components and high-speed, low-latency network fabrics for Message Passing Interface (MPI) connections. Computing clusters have a head node for administration and a scheduler for managing jobs. HPC workloads often require large numbers of nodes with nonblocking MPI networks for scalability. High-bandwidth I/O networks are essential. Enabling Direct Cache Access (DCA) support allows network packets to go directly into the Layer 3 processor cache, reducing HPC I/O cycles and increasing system performance. For more information, see AMD's High-Performance Computing (HPC) Tuning Guide.
| BIOS options | Virtualization/ container | RDBMS | Big-data analytics | HPC |
|---|---|---|---|---|
| Processor | ||||
| CPU SMT mode | Enabled | Enabled | Disabled | Disabled |
| SVM mode | Enabled | Enabled | Enabled | Enabled |
| DF C-states | Auto (Enabled) | Disabled | Auto | Auto |
| ACPI SRAT L3 Cache as NUMA Domain | Auto (Disabled) | Auto | Auto | Auto |
| APBDIS | Auto (0) | 1 | 1 | 1 |
| Fixed SOC P-State SP5F 19h | P0 | P0 | P0 | P0 |
| 4-link xGMI max speed* | Auto (32Gbps) | Auto | Auto | Auto |
| Enhanced CPU performance* | Disabled | Disabled | Disabled | Auto |
| Memory | ||||
| NUMA nodes per socket | Auto (NPS1) | NPS4 | Auto | NPS4 |
| IOMMU | Auto (Enabled) | Auto | Auto | Auto |
| Memory interleaving | Auto (Enabled) | Auto | Auto | Auto |
| Power/performance | ||||
| Core performance boost | Auto (Enabled) | Auto | Auto | Auto |
| Global C-State control | Disabled | Enabled | Enabled | Enabled |
| L1 Stream HW Prefetcher | Auto (Enabled) | Auto | Auto | Auto |
| L2 Stream HW Prefetcher | Auto (Enabled) | Auto | Auto | Auto |
| Determinism slider | Auto (Power) | Auto | Auto | Auto |
| CPPC | Auto (Disabled) | Enabled | Auto | Enabled |
| Power profile selection F19h | High-performance mode | High-performance mode | Maximum I/O performance mode | High-performance mode |
Note: BIOS tokens with highlighted are not applicable only for single socket optimized platform like Cisco UCS C225 M8 1U Rack Server.
- If your workloads have few vCPUs per virtual machine (less than a quarter of the number of cores per socket), the following settings tend to provide the best performance: NUMA NPS (nodes per socket) = 4, LLC As NUMA turned on.
- If your workload virtual machines have a large number of vCPUs (greater than half the number of cores per socket), the following settings tend to provide the best performance: NUMA NPS (nodes per socket) = 1, LLC As NUMA turned off.
For more information, see the VMware vSphere Tuning Guide.
Operating system tuning guidance for high performance
Microsoft Windows, VMware ESXi, Red Hat Enterprise Linux, and SUSE Linux operating systems have default power management features. Tuning the operating system is necessary for optimal performance. For additional performance documentation, see the AMD EPYC performance tuning guides.
Linux (Red Hat and SUSE)
The CPUfreq governor defines system CPU power characteristics, influencing performance. Each governor has unique behavior and suitability for different workloads. The 'performance' governor forces the CPU to its highest clock frequency, statically set and unchanging, offering no power savings. It's suitable for heavy workloads when the CPU is rarely idle. The default 'on demand' setting allows the CPU to reach maximum frequency under high load and minimum frequency when idle, adjusting power consumption at the cost of latency from frequency switching. The performance governor can be set using the cpupower frequency-set -g performance command.
For additional information:
- Red Hat Enterprise Linux: Set the performance CPUfreq governor.
- SUSE Enterprise Linux Server: Set the performance CPUfreq governor.
Microsoft Windows Server 2019 and 2022
For Microsoft Windows Server 2019, the default 'Balanced' power plan conserves energy but can increase latency and cause performance issues for CPU-intensive applications. For maximum performance, set the power plan to 'High Performance'.
For additional information, see the following link:
- Microsoft Windows and Hyper-V: Set the power policy to High Performance.
VMware ESXi
In VMware ESXi, host power management is designed to reduce power consumption. Set the power policy to 'High Performance' to achieve maximum performance.
For additional information, see the following links:
- VMware ESXi: Set the power policy to High Performance.
Conclusion
When tuning system BIOS settings for performance, consider processor and memory options. Prioritize performance-optimizing options over power savings if peak performance is the goal. Experiment with settings like memory interleaving and CPU hyperthreading. Crucially, assess the impact of any settings on application performance requirements.
For more information
For more information about the Cisco UCS M8 Server with AMD 4th gen & 5th gen processors, consult the following resources:
- IMM BIOS token guide: https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/Intersight/IMM_BIOS_Tokens_Guide/b_IMM_Server_BIOS_Tokens_Guide.pdf
- Cisco UCS X215c M8 Compute Node: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/ucs-x215c-m8-compute-node-aag.html
- Cisco UCS C245 M8 Rack Server: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/ucs-c245-m8-rack-server-aag.html
- Cisco UCS C225 M8 Rack Server: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/ucs-c225-m8-rack-server-aag.html
- AMD EPYC tuning guides:
- https://developer.amd.com/resources/epyc-resources/epyc-tuning-guides/
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58015-epyc-9004-tg-architecture-overview.pdf
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/white-papers/58649_amd-epyc-tg-low-latency.pdf
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/57996-epyc-9004-tg-rdbms.pdf
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58002_amd-epyc-9004-tg-hpc.pdf
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58008-epyc-9004-tg-containers-on-kubernetes.pdf
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58013-epyc-9004-tg-hadoop.pdf
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58007-epyc-9004-tg-mssql-server.pdf
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58001_amd-epyc-9004-tg-vdi.pdf
File Info : application/pdf, 27 Pages, 809.93KB
DocumentDocumentRelated Documents
![]() |
Cisco UCS C245 M8 SFF Rack Server Spec Sheet and Configuration Guide Detailed specifications and configuration guide for the Cisco UCS C245 M8 SFF Rack Server. This document outlines its capabilities, including AMD EPYC processors, extensive memory support, versatile storage solutions, and advanced networking, designed for enterprise environments. |
![]() |
Cisco Unified Computing System (UCS) Solution Overview An overview of Cisco Unified Computing System (UCS) components, including unified fabric, management, and computing resources, designed for modern applications, virtualization, and cloud computing. |
![]() |
Optimizing Cisco ASAv Deployment for Performance and Efficiency A comprehensive guide to optimizing the Cisco Adaptive Security Virtual Appliance (ASAv) deployment, covering licensing, SR-IOV provisioning, performance tuning across VMware and KVM environments, and AWS deployment considerations. |
![]() |
Cisco UCS Integrated Infrastructure for Big Data and Analytics with Hortonworks Data Platform 3.0 A comprehensive guide detailing the design, deployment, and configuration of Cisco UCS Integrated Infrastructure for Big Data and Analytics, combined with Hortonworks Data Platform 3.0 and NVIDIA GPUs. This document covers architecture, installation steps, software versions, and scaling for enterprise-grade big data and AI/ML workloads. |
![]() |
Cisco Enterprise NFVIS: Product Overview and Benefits Learn about Cisco Enterprise NFVIS, a Linux-based infrastructure software for designing, deploying, and managing virtualized network functions. Discover its benefits, supported hardware, and virtual machines. |
![]() |
Cisco UCS C885A M8 Rack Server: High-Performance AI Compute Discover the Cisco UCS C885A M8 Rack Server, a dense GPU server engineered for demanding AI workloads like LLM training, fine-tuning, and inference. Featuring NVIDIA HGX or AMD MI GPUs, it offers scalable accelerated compute capabilities. |
![]() |
Cisco UCS C220/C240/B200 M5 Memory Guide This guide provides detailed information on memory configurations, population rules, and system speeds for Cisco UCS C220, C240, and B200 M5 servers. It covers DDR4 DIMMs and Intel Optane Persistent Memory (PMEMs), including installation procedures and compatibility notes. |
![]() |
Second-Generation Intel Xeon Scalable Processor Refresh Selection Guide for VDI on Cisco UCS with VMware Horizon 7 This white paper from Cisco and VMware evaluates the performance and price-to-performance ratio of second-generation Intel Xeon Scalable processors for Virtual Desktop Infrastructure (VDI) deployments using Cisco UCS servers and VMware Horizon 7, covering task, knowledge, and power user profiles. |








