Performance Tuning for Cisco UCS M8 Platforms with AMD EPYC 4th Gen and 5th Gen Processors

White paper

Cisco public

Document purpose and scope

The Basic Input-Output System (BIOS) initializes hardware and boots the operating system. BIOS settings control system behavior, with some directly impacting performance. This document outlines BIOS settings for Cisco UCS M8 servers with AMD EPYC 4th and 5th Gen processors, focusing on optimizing performance and energy efficiency for Cisco UCS X215c M8 Compute Nodes, Cisco UCS C245 M8 Rack Servers, and Cisco UCS C225 M8 Rack Servers. It also discusses BIOS settings for various workloads on these servers. The settings provided are generic and not specific to particular firmware releases.

What you will learn

This document guides users through system BIOS performance settings, offering suggestions to achieve optimal performance on Cisco UCS M8 servers with 4th and 5th Gen AMD EPYC CPUs. It aims to demystify BIOS options, helping users balance power savings and performance.

AMD EPYC 9004 Series processors

The AMD EPYC 9004 Series processors feature Zen 4 cores and AMD Infinity architecture. They integrate compute cores, memory controllers, I/O controllers, Reliability, Availability, and Serviceability (RAS), and security features into a System on a Chip (SoC). This series utilizes a Multi-Chip Module (MCM) Chiplet architecture, enhancing the SoC components. The architecture includes Core Complex Dies (CCDs) containing Core Complexes (CCXs), which house the Zen 4-based cores. The Zen 4 core, built on a 5nm process, offers improved Instructions Per Cycle (IPC) and frequency over previous generations, with enhanced L2 cache effectiveness. Each core supports Simultaneous Multithreading (SMT), allowing two hardware threads to run independently. A Core Complex (CCX) supports up to eight Zen 4 cores sharing an L3 cache. With SMT enabled, a single CCX can support up to 16 concurrent hardware threads.

These processors incorporate AMD 3D V-Cache die-stacking technology for improved chiplet integration and up to 96MB of L3 cache per die. The industry-leading logic stacking enables high interconnect densities, leading to lower latency, higher bandwidth, and better power/thermal efficiency. The CCDs connect to memory, I/O, and each other via an updated I/O Die (IOD) using AMD Infinity Fabric. The IOD supports up to 4 xGMI (or G-links) with speeds up to 32Gbps and exposes DDR5 memory channels, PCIe Gen5, CXL 1.1+, and Infinity Fabric links. Each IOD provides twelve Unified Memory Controllers (UMCs) supporting DDR5 memory. Each UMC supports up to 2 DIMMs per channel, allowing for up to 24 DIMMs per socket. 4th Gen AMD EPYC processors support up to 6TB of DDR5 memory per socket, offering increased memory bandwidth. Memory interleaving across 2, 4, 6, 8, 10, and 12 channels optimizes performance for various workloads. Processors feature 4 P-links and 4 G-links, with G-links usable for connecting to a second processor or providing additional PCIe Gen5 lanes. 4th Gen AMD EPYC processors support up to 128 lanes of PCIe Gen5 in single-socket and up to 160 lanes in dual-socket configurations.

AMD EPYC 9004 Series 4th Gen processor specifications
Item	Specification
Cores process technology	5-nanometer (nm) Zen 4
Maximum number of cores	128
Maximum memory speed	4800 Mega-Transfers per second (MT/s)
Maximum memory channels	12 per socket
Maximum memory capacity	6 TB per socket
PCI	128 lanes (maximum) for 1-socket
PCIe Gen 5	160 lanes (maximum) for 2-socket

For more information, refer to the Overview of AMD EPYC 9004 Series Processors Microarchitecture.

AMD EPYC 9005 Series processors

5th Gen AMD EPYC processors support IT initiatives for data-center consolidation and modernization, catering to demanding enterprise applications. They enable AI expansion, improve energy efficiency, and support high-density virtualization and cloud environments. These processors deliver significant uplifts in instruction-per-clock-cycle (IPC) performance, particularly for ML, HPC, and enterprise workloads, with the efficiency-optimized Zen 5c core powering CPUs with the highest core counts for virtualized and cloud workloads. The hybrid, multichip architecture allows for decoupled innovation paths. The Zen 5 and Zen 5c cores represent advancements with new support for complex machine-learning and inferencing applications.

Zen 5 core: Optimized for high performance, with up to eight cores forming a core complex (CCX) featuring a 32-MB shared L3 cache. Up to 16 CCDs can be configured into an EPYC 9005 processor, supporting up to 128 cores in the SP5 form factor. Compared to the previous generation, Zen 5 cores offer a 20 percent greater integer and 34 percent higher floating-point performance in 64-core processors within the same 360W TDP range.

Zen 5c core: Optimized for density and efficiency, sharing register-transfer logic with the Zen 5 core but with a smaller physical footprint for improved performance per watt. The Zen 5c core complex includes up to 16 cores and a shared 32-MB L3 cache. Up to 12 CCDs can be combined with an I/O CCD to deliver CPUs with up to 192 cores in an SP5 form factor.

AMD EPYC 9005 Series 5th gen processor specifications
Item	Specification
Cores process technology	4-nanometer (nm) Zen 5 and 3-nanometer Zen 5c
Maximum number of cores	192
Maximum L3 cache	512 MB
Maximum memory speed	6000 Mega-Transfers per second (MT/s)
Maximum memory channels	12 per socket
Maximum memory capacity	6 TB per socket
PCI	128 lanes (max.) for 1-socket
PCIe Gen 5	160 lanes (max.) for 2-socket

Note: Cisco UCS M8 platforms support only up to 160 cores 400W TDP of Zen 5c processors.

For more information, refer to the Overview of AMD EPYC 9005 Series Processors Microarchitecture.

Non-Uniform Memory Access (NUMA) topology

AMD EPYC 9004 and 9005 Series processors utilize a Non-Uniform Memory Access (NUMA) architecture, where memory access latency varies based on proximity to processor cores and I/O controllers. Utilizing resources within the same NUMA node ensures good performance, while cross-node access increases latency. The system's NUMA Nodes Per Socket (NPS) BIOS setting can be adjusted to optimize this topology for specific operating environments and workloads. For example, NPS=4 divides the processor into four quadrants, each with 3 CCDs, 3 UMCs, and 1 I/O hub. Proximity within a quadrant offers the shortest processor-memory I/O distance. The cross-diagonal or cross-socket distance is the furthest. Core, memory, and I/O hub locality within a NUMA system is crucial for performance tuning.

Figure 1. AMD EPYC 4th Gen processor block diagram with NUMA domains

Optimizations in 4th Gen EPYC processors' Infinity Fabric interconnects have further reduced latency. For applications requiring minute latency improvements, creating affinity between memory ranges and CPU dies (Zen 4 or Zen 4c) can boost performance. Figure 1 illustrates this: dividing the I/O die into four quadrants for NPS=4 configuration shows six DIMMs feeding three memory controllers, closely connected via Infinity Fabric (GMI) to up to three Zen 4 CPU dies (or 24 CPU cores).

Figure 2. AMD EPYC 5th Gen processor block diagram with NUMA domains

Improvements in 5th Gen EPYC processors' AMD Infinity Fabric interconnects have further reduced latency. For applications needing marginal latency gains, establishing affinity between memory ranges and CPU dies (Zen 5 or Zen 5c) can enhance performance. Figure 2 demonstrates this: dividing the I/O die into four quadrants for NPS=4 configuration shows six DIMMs feeding three memory controllers, connected via Infinity Fabric (GMI) to up to four Zen 5 CPU dies (or three Zen 5c CPU dies).

NPS1: Configures all memory channels into a single NUMA node, encompassing all processor cores, memory, and PCIe devices. Memory is interleaved across all channels into a single address space.

NPS2: Divides the processor into two NUMA domains, each containing half the cores and memory channels. Memory is interleaved across the six memory channels within each domain. PCIe devices are local to the NUMA node containing their root complex. This setting reports two NUMA nodes per socket.

NPS4: Partitions the processor into four NUMA nodes per socket, with each logical quadrant acting as a NUMA domain. Memory is interleaved across the memory channels of each quadrant. PCIe devices are local to the NUMA domain with their root complex. This configuration interleaves every pair of memory channels and is recommended for HPC and highly parallel workloads. NPS4 is required for booting Windows systems with CPU SMT enabled on processors with over 64 cores, as Windows limits CPU groups to 64 logical cores.

Note: For Windows systems, ensure the number of logical processors per NUMA node is <=64 by using NPS2 or NPS4 instead of the default NPS1.

NPS0 (not recommended)

A setting of NPS=0 creates a single NUMA domain for the entire system, interleaving memory across all channels into one address space. All processor cores, memory, and PCIe devices across all sockets reside within this single NUMA domain.

Layer 3 cache as NUMA Domain

The Layer 3 Cache as NUMA (L3CAN) BIOS option exposes each Layer-3 cache (one per CCD) as its own NUMA node. For instance, a single processor with 8 CCDs would have 8 NUMA nodes. A two-socket system would have 16 NUMA nodes. This setting can improve performance for NUMA-optimized workloads by pinning them to cores within a CCX and leveraging shared L3 cache. When disabled, NUMA domains follow the NPS setting.

Processor settings

This section details configurable processor options.

CPU SMT Mode

The CPU Simultaneous Multithreading (CPU SMT) option allows enabling or disabling logical processor cores. When set to Auto (enabled), each physical core acts as two logical cores, facilitating multithreaded applications. For some workloads, including HPC, CPU SMT can yield neutral or negative performance. Disabling CPU SMT might be beneficial, especially if the operating system lacks x2APIC support for over 255 threads. Testing CPU hyperthreading (enabled/disabled) in your specific environment is recommended. Disable hyperthreading for single-threaded applications.

CPU SMT settings
Setting	Options
CPU SMT control	Auto: uses two hardware threads per core Disable: uses a single hardware thread per core Enable: uses a double hardware thread per core

Secure Virtual Machine (SVM) mode

The Secure Virtual Machine (SVM) mode enables processor virtualization features, allowing the platform to run multiple operating systems in independent partitions. SVM mode can be set to Enabled or Disabled. If virtualization is not required, disable AMD virtualization technology and the AMD IOMMU option to avoid latency differences in memory access.

Virtualization option settings
Setting	Options
SVM	Enabled Disabled

DF C-states

The AMD Infinity Fabric can enter low-power states when idle, but transitioning back to full power may cause latency jitter. For low-latency or bursty I/O workloads, disabling Data Fabric (DF) C-states can improve performance at the cost of higher power consumption.

DF C-states
Setting	Options
DF C-states	Auto/Enabled: allows the AMD Infinity Fabric to enter a low-power state Disabled: prevents the AMD Infinity Fabric from entering a low-power state

ACPI SRAT L3 Cache as NUMA Domain

When the ACPI SRAT L3 Cache as NUMA Domain setting is enabled, each Layer-3 cache (one per CCD) is exposed as a NUMA node. This can enhance performance for NUMA-optimized workloads if they can be pinned to cores within a CCX and benefit from shared L3 cache. When disabled, NUMA domains are identified by the NUMA NPS setting. Some operating systems and hypervisors may not perform Layer 3-aware scheduling, while others benefit from Layer 3 being declared as a NUMA domain.

ACPI SRAT Layer 3 Cache as NUMA Domain settings
Setting	Options
ACPI SRAT L3 Cache As NUMA Domain	Auto (disabled) Disable: does not report each Layer-3 cache as a NUMA domain to the OS Enable: reports each Layer-3 cache as a NUMA domain to the OS

Algorithm Performance Boost Disable (APBDIS)

The APBDIS setting controls the Algorithm Performance Boost (APB) for the SMU. By default, the AMD Infinity Fabric dynamically switches between full-power and low-power fabric and memory clocks based on usage. In latency-sensitive scenarios, this transition can cause adverse latency effects. Setting APBDIS to 1 and a fixed Infinity Fabric P-state of 0 forces full-power mode, eliminating latency jitter. Setting a fixed Infinity Fabric P-state of 1 may reduce memory latency at the cost of memory bandwidth, benefiting latency-sensitive applications.

APBDIS setting
Setting	Options
APBDIS	Auto (0): sets an auto APBDIS for the SMU (default). 0: dynamically switches Infinity Fabric P-state based on link use. 1: enables fixed Infinity Fabric P-state control.

Fixed SOC P-State SP5F 19h

This setting determines the SOC P-State (independent or dependent), as reported by the ACPI _PSD object, and changes the SOC P-State when APBDIS is enabled. 'F' refers to the processor family.

Fixed SOC P-State SP5F 19h
Setting	Options
Fixed SOC P-State SP5F 19h	P0: highest-performing Infinity Fabric P-state P1: next-highest-performing Infinity Fabric P-state P2: next-highest-performing Infinity Fabric P-state after P1

xGMI settings: connection between sockets

In two-socket systems, processors are interconnected via socket-to-socket xGMI links, part of the Infinity Fabric. NUMA-unaware workloads may require maximum xGMI bandwidth for cross-socket communication. NUMA-aware workloads might prefer to minimize xGMI power, potentially reducing cross-socket traffic and utilizing increased CPU boost. xGMI lane width can be reduced from x16 to x8 or x2, or an xGMI link can be disabled to conserve power.

xGMI link configuration and 4-link xGMI max speed (Cisco xGMI max Speed)

The number of xGMI links and maximum speed can be configured. Lowering the speed can save uncore power, potentially increasing core frequency or reducing overall power, but it decreases cross-socket bandwidth and increases latency. Cisco UCS C245 M8 Rack Server supports four xGMI links with a maximum speed of 32 Gbps. Enabling Cisco xGMI max speed sets xGMI Link Configuration to 4 and 4-Link xGMI Max Speed to 32 Gbps. Disabling it applies default values.

XGMI link settings
Setting	Options
Cisco XGMI Max Speed	Disabled (default) Enabled
xGMI Link Configuration	Auto 1 2 3 4
4-Link xGMI Max Speed	Auto (25 Gbps) 20 Gbps 25 Gbps 32 Gbps
3-Link xGMI Max Speed	Auto (25 Gbps) 20 Gbps 25 Gbps 32 Gbps

Note: This BIOS feature applies only to Cisco UCS X215c M8 Compute Nodes and Cisco UCS C245 M8 Rack Servers with 2-socket configurations.

Enhanced CPU performance

This BIOS option allows users to adjust enhanced CPU performance settings. When enabled, it optimizes processor settings for aggressive operation, potentially improving overall CPU performance but increasing power consumption. Values can be Auto or Disabled. By default, this option is disabled.

Note: This BIOS feature applies only to Cisco UCS X215c M8 Compute Nodes and Cisco UCS C245 M8 Rack Servers. When enabled, setting the fan policy to maximum power is highly recommended. By default, this BIOS setting is Disabled.

Memory settings

This section covers memory configuration options.

NUMA Nodes Per Socket (NPS)

The NPS setting specifies the number of NUMA Nodes Per Socket, balancing local memory latency for NUMA-aware workloads against per-core memory bandwidth for non-NUMA-friendly workloads. Socket interleave (NPS0) attempts to interleave two sockets into one NUMA node. 4th Gen AMD EPYC processors support various NPS values depending on internal topology. NPS2 and NPS4 might not be available on all processors or memory configurations. For single-socket servers, NPS can be 1, 2, or 4. Performance for NUMA-optimized applications can improve with NPS values greater than 1. The default configuration (one NUMA Domain per socket) is recommended for most workloads. NPS4 is recommended for High-Performance Computing (HPC) and highly parallel workloads. For 200-Gbps network adapters, NPS2 may offer a balance between memory latency and bandwidth for the Network Interface Card (NIC). This setting is independent of the ACPI SRAT L3 Cache as NUMA Domain setting. When ACPI SRAT L3 Cache as NUMA Domain is enabled, this setting determines memory interleaving granularity. With NPS1, all eight memory channels are interleaved. With NPS2, every four channels are interleaved. With NPS4, every pair of channels is interleaved.

NUMA NPS settings
Setting	Options
NUMA Nodes per Socket	Auto (NPS1) NPS0: interleave memory accesses across all channels in both sockets (not recommended). NPS1: interleave memory accesses across all eight channels in each socket; reports one NUMA node per socket (unless L3 Cache as NUMA is enabled). NPS2: interleave memory accesses across groups of four channels (ABCD and EFGH) in each socket; reports two NUMA nodes per socket (unless L3 Cache as NUMA is enabled). NPS4: interleave memory accesses across pairs of channels (AB, CD, EF, and GH) in each socket; reports four NUMA nodes per socket (unless L3 Cache as NUMA is enabled).

I/O Memory Management Unit (IOMMU)

The I/O Memory Management Unit (IOMMU) provides benefits and is required for x2 programmable interrupt controller (x2APIC). Enabling IOMMU allows devices like the EPYC integrated SATA controller to issue separate interrupt requests (IRQs) for each device, instead of one IRQ for the subsystem. IOMMU also enhances operating system protection for Direct Memory Access (DMA)-capable I/O devices and helps filter and remap interrupts from peripheral devices.

IOMMU settings
Setting	Options
IOMMU	Auto (enabled) Disabled: disable IOMMU support Enabled: enable IOMMU support

Memory interleaving

Memory interleaving increases memory bandwidth by reading consecutive memory blocks from different memory banks, preventing wait times for memory transfers. AMD recommends populating all eight memory channels per CPU socket with equal capacity for optimal performance in eight-way interleaving mode.

Memory interleaving settings
Setting	Options
Memory interleaving	Enabled: interleaving is enabled with supported memory DIMM configuration. Disable: no interleaving is performed.

Power settings

This section covers power state settings.

Core performance boost

The Core performance boost feature allows the processor to exceed its base frequency based on power, thermal headroom, and active cores. This can cause jitter due to frequency transitions. For workloads not requiring maximum core frequency, setting a maximum core boost frequency can improve power efficiency. This setting limits the maximum boost frequency, not sets a fixed frequency. Actual boost performance depends on various factors and other settings.

Core performance boost settings
Setting	Options
Core performance boost	Auto (enabled): allows the processor to transition to a higher frequency (turbo frequency) than the CPU's base frequency Disabled: disables the CPU core boost frequency

Global C-state control

C-states are processor core inactive power states, with CO being the operational state and higher C-states being low-power idle states. Global C-state control enables or disables C-states. Auto (enabled) allows cores to enter lower power states, which can cause jitter due to frequency transitions. Disabled forces CPU cores to operate at CO and C1 states. C-states are exposed via ACPI objects and can be requested by software. The 4th Gen AMD EPYC processor core supports I/O-based C0, C1, and C2 states.

Global C-state settings
Setting	Options
Global C-state control	Auto (enabled): enables I/O-based C-states Disabled: disables I/O-based C-states

Layer-1 and Layer-2 stream hardware prefetchers

Layer-1 and Layer-2 stream hardware prefetchers (L1 Stream HW Prefetcher and L2 Stream HW Prefetcher) gather data to keep the core pipeline busy. Most workloads benefit from these, but some random workloads perform better with one or both disabled. By default, both are enabled.

Layer-1 and Layer-2 stream hardware prefetcher settings
Setting	Options
L1 Stream HW Prefetcher	Auto (Enabled) Disable: disables prefetcher Enable: enables prefetcher
L2 Stream HW Prefetcher	Auto (Enabled) Disable: disables prefetcher Enable: enables prefetcher

Determinism slider

The Determinism slider allows selection between uniform performance across identically configured systems (Performance setting) or maximum individual system performance with potential variation across the data center (Power setting). For the Performance setting, ensure configurable Thermal Design Power (cTDP) and Package Power Limit (PPL) are set to the same value. The default Auto setting typically favors Performance mode, allowing lower power operation with consistent performance. For maximum performance, set the Determinism slider to Power.

Determinism slider settings
Setting	Options
Determinism slider	Auto: equal to the Performance option. Power: ensures maximum performance levels for each CPU in a large population of identically configured CPUs by throttling CPUs only when they reach the same cTDP. Performance: ensures consistent performance levels across a large population of identically configured CPUs by throttling some CPUs to operate at a lower power level.

CPPC: Collaborative Processor Performance Control

Collaborative Processor Performance Control (CPPC), introduced with ACPI 5.0, facilitates communication of performance between the operating system and hardware. It allows the OS to control turbo boost for energy efficiency. Not all operating systems support CPPC; Microsoft added support starting with Windows 2016.

CPPC settings
Setting	Options
CPPC	Auto Disabled: disabled Enabled: allows the OS to make performance and power optimization requests using ACPI CPPC

Power profile selection F19h

The DF P-state selection in the profile policy is overridden by the P-state range, BIOS option, or APB_DIS BIOS option. 'F' denotes the processor family and 'M' denotes the model.

Power profile selection F19h
Settings	Options
Power profile selection F19h	Efficiency mode High-performance mode Maximum I/O performance mode Balanced memory performance mode Balanced core performance mode Balanced core memory performance mode Auto

Fan control policy

Fan policy allows control over fan speed to reduce server power consumption and noise. Previously, fan speed increased automatically when component temperature exceeded a threshold. To maintain low fan speeds, thresholds were set high, which suited most configurations but didn't address specific needs. For maximum CPU performance, CPUs require substantial cooling below the threshold temperature, leading to high fan speeds, increased power consumption, and noise. For low power consumption, fans run slowly, potentially causing overheating; moderate speeds are needed. The available fan policies are:

Balanced: Default policy, suitable for most configurations but may not be ideal for servers with easily overheating PCIe cards.
Low Power: Suitable for minimal-configuration servers without PCIe cards.
High Power: For configurations requiring fan speeds from 60-85 percent, suitable for servers with easily overheating PCIe cards. Minimum fan speed varies by platform but is approximately 60-85 percent.
Maximum Power: For configurations requiring extremely high fan speeds (70-100 percent), suitable for servers with easily overheating PCIe cards. Minimum fan speed varies by platform but is approximately 70-100 percent.
Acoustic: Reduces fan speed for noise-sensitive environments. May cause short-term throttling for reduced noise, potentially impacting performance transiently.

Note: This policy is configurable for standalone Cisco UCS C-Series M8 servers via the Cisco Integrated Management Controller (IMC) console and supervisor. For Cisco Intersight-managed C-Series M8 servers, it's configurable through fan policies.

BIOS settings for Cisco UCS X215c M8 Compute Nodes, Cisco UCS C245 M8 Rack Servers, and Cisco UCS C225 M8 Rack Servers

Table 17 lists BIOS token names, defaults, and supported values for Cisco UCS M8 servers with AMD EPYC 4th and 5th Gen processor families.

BIOS token names and values
BIOS token name	Default value	Supported values
Processor
CPU SMT mode	Auto (enabled)	Auto, Enabled, Disabled
SVM mode	Enabled	Enabled, Disabled
DF C-states	Auto (enabled)	Auto, Enabled, Disabled
ACPI SRAT L3 Cache as NUMA Domain	Auto (disabled)	Auto, Enabled, Disabled
APBDIS	Auto (0)	Auto, 0, 1
Fixed SOC P-State SP5F 19h	P0	P0, P1, P2
4-link xGMI max speed*	Auto (32Gbps)	Auto, 20Gbps, 25Gbps, 32Gbps
Enhanced CPU performance*	Disabled	Auto, Disabled
Memory
NUMA nodes per socket	Auto (NPS1)	Auto, NPS0, NPS1, NPS2, NPS4
IOMMU	Auto (enabled)	Auto, Enabled, Disabled
Memory interleaving	Auto (enabled)	Auto, Enabled, Disabled
Power/performance
Core performance boost	Auto (enabled)	Auto, Disabled
Global C-state control	Disabled	Auto, Enabled, Disabled
L1 Stream HW Prefetcher	Auto (enabled)	Auto, Enabled, Disabled
L2 Stream HW Prefetcher	Auto (enabled)	Auto, Enabled, Disabled
Determinism slider	Auto (power)	Auto, Power, Performance
CPPC	Auto (disabled)	Auto, Disabled, Enabled
Power profile selection F19h	High-performance mode	Balanced memory performance mode, efficiency mode, high-performance mode, maximum I/O performance mode, balanced core performance mode, balanced core memory performance mode

BIOS recommendations for various general-purpose workloads

This section summarizes recommended BIOS settings for optimizing general-purpose workloads, categorized as:

Computation-intensive
I/O-intensive
Energy efficiency
Low latency

CPU intensive workloads

For CPU-intensive workloads, the goal is to distribute work across multiple CPUs to minimize processing time. This involves running job portions in parallel, with CPUs exchanging information rapidly. These workloads benefit from processors or memory achieving maximum turbo frequency, with power management settings aiding frequency increases. Optimizations focus on increasing processor core and memory speed.

I/O-intensive workloads

I/O-intensive optimizations focus on maximizing throughput between I/O and memory. Processor utilization-based power management features affecting links between I/O and memory are disabled.

Energy-efficient workloads

Energy-efficient optimizations are common balanced settings that benefit most workloads while enabling power management settings with minimal performance impact. Settings applied enhance general application performance rather than prioritizing power efficiency. Processor power management settings can affect performance with virtualization operating systems. These are recommended for users who do not typically tune BIOS settings.

Low-latency workloads

Workloads requiring low latency, such as financial trading and real-time processing, demand consistent system response and minimal computational latency. Maximum speed and throughput are often sacrificed for lower latency. Processor power management and other features that might introduce latency are disabled. Achieving low latency requires understanding system hardware configuration, including core count, threads per core, NUMA nodes, CPU/memory arrangements, and cache topology. BIOS options are generally OS-independent, but a tuned low-latency operating system is also necessary for deterministic performance.

BIOS recommendations for CPU intensive, I/O-intensive, energy-efficiency, and low-latency workloads
BIOS options	CPU intensive	I/O intensive	Energy efficiency	Low latency
Processor
CPU SMT mode	Auto (enabled)	Auto	Auto	Disabled
SVM mode	Enabled	Enabled	Enabled	Disabled
DF C-states	Auto (enabled)	Auto	Disabled	Disabled
ACPI SRAT L3 Cache as NUMA Domain	Auto (disabled)	Enabled	Auto	Auto
APBDIS	Auto (0)	1	Auto	Auto
Fixed SOC P-State SP5F 19h	P0	P0	P2	P0
4-link xGMI max speed	Auto (32Gbps)	Auto	Auto	Auto
Enhanced CPU performance	Disabled	Auto	Disabled	Disabled
Memory
NUMA nodes per socket	Auto (NPS1)	NPS4	NPS4	Auto
IOMMU	Auto (enabled)	Auto*	Auto	Auto
Memory interleaving	Auto (enabled)	Auto*	Auto	Auto
Power/performance
Core performance boost	Auto (enabled)	Auto	Auto	Disabled
Global C-State control	Disabled	Disabled	Enabled	Disabled
L1 Stream HW Prefetcher	Auto (enabled)	Auto	Auto	Disabled
L2 Stream HW Prefetcher	Auto (enabled)	Auto	Auto	Disabled
Determinism slider	Auto (power)	Auto	Auto	Performance
CPPC	Auto (disabled)	Auto	Auto	Enabled
Power profile selection F19h	High-performance mode	High-performance mode	Maximum I/O performance mode	Efficiency mode

Note: BIOS tokens with * highlighted are applicable only for Cisco UCS X215c M8 Compute Nodes and Cisco UCS C245 M8 Rack Servers.

If your application scenario does not require virtualization, disable AMD virtualization technology. With virtualization disabled, also disable the AMD IOMMU option, as it can cause latency differences in memory access. See the AMD performance tuning guide for more information.

Additional BIOS recommendations for enterprise workloads

This section provides optimal BIOS settings for enterprise workloads, including:

Virtualization
Containers
Relational Database (RDBMS)
Analytical Database (Bigdata)
HPC workloads

Virtualization workloads

AMD Virtualization Technology offers manageability, security, and flexibility for IT environments using software-based virtualization. It allows a single server to be partitioned into multiple independent servers, running different applications simultaneously. Enabling AMD Virtualization Technology in the BIOS is crucial for supporting virtualization workloads. CPUs supporting hardware virtualization enable running multiple operating systems in virtual machines, though this incurs some overhead compared to native OS performance. For more information, see AMD's VMware vSphere Tuning Guide.

Container workloads

Containerizing applications abstracts infrastructure and OS differences for efficiency. Each container includes an entire runtime environment, application dependencies, libraries, and configuration files. Production environments require management for consistent uptime, with automatic container restarts if one fails. Workloads that scale well on bare metal should exhibit similar scaling in a container environment with minimal performance overhead. Large overhead often indicates suboptimal application settings or container configuration. CPU load balancing by Kubernetes or other schedulers may differ from bare metal environments. For more information, see AMD's Kubernetes Container Tuning Guide.

Relational Database workloads

Integrating RDBMS like Oracle, MySQL, PostgreSQL, or Microsoft SQL Server with AMD EPYC processors can enhance database performance, particularly in high-concurrency, rapid query processing, and efficient resource utilization environments. The AMD EPYC processor architecture effectively leverages multiple cores and threads, benefiting transactional workloads, analytics, and large-scale data processing. Using AMD EPYC processors in RDBMS environments can significantly improve performance, scalability, and cost-efficiency. 4th Gen AMD EPYC processors deliver high Input/Output Operations Per Second (IOPS) and throughput for all databases. Selecting the right CPU is vital for optimal database application performance. For more information, see AMD's RDBMS Tuning Guide.

Big Data Analytics workloads

Big Data Analytics involves examining vast data to uncover patterns, correlations, and insights for better decision-making. This requires significant computational power, memory capacity, and I/O bandwidth, areas where AMD EPYC processors excel. AMD EPYC processors offer a robust platform for Big Data Analytics, providing the necessary computational power, memory capacity, and I/O bandwidth for large-scale data processing. Their scalability, cost efficiency, and energy efficiency make them a compelling choice for organizations building or upgrading their Big Data Analytics infrastructure.

HPC (High-performance computing) workloads

HPC refers to cluster-based computing using multiple interconnected nodes to process large datasets faster than single systems. HPC workloads are computation and network-I/O intensive, requiring high-quality CPU components and high-speed, low-latency network fabrics for Message Passing Interface (MPI) connections. Computing clusters have a head node for administration and a scheduler for managing jobs. HPC workloads often require large numbers of nodes with nonblocking MPI networks for scalability. High-bandwidth I/O networks are essential. Enabling Direct Cache Access (DCA) support allows network packets to go directly into the Layer 3 processor cache, reducing HPC I/O cycles and increasing system performance. For more information, see AMD's High-Performance Computing (HPC) Tuning Guide.

BIOS recommendations for virtualization, containers, RDBMS, big-data analytics, and HPC enterprise workloads
BIOS options	Virtualization/ container	RDBMS	Big-data analytics	HPC
Processor
CPU SMT mode	Enabled	Enabled	Disabled	Disabled
SVM mode	Enabled	Enabled	Enabled	Enabled
DF C-states	Auto (Enabled)	Disabled	Auto	Auto
ACPI SRAT L3 Cache as NUMA Domain	Auto (Disabled)	Auto	Auto	Auto
APBDIS	Auto (0)	1	1	1
Fixed SOC P-State SP5F 19h	P0	P0	P0	P0
4-link xGMI max speed*	Auto (32Gbps)	Auto	Auto	Auto
Enhanced CPU performance*	Disabled	Disabled	Disabled	Auto
Memory
NUMA nodes per socket	Auto (NPS1)	NPS4	Auto	NPS4
IOMMU	Auto (Enabled)	Auto	Auto	Auto
Memory interleaving	Auto (Enabled)	Auto	Auto	Auto
Power/performance
Core performance boost	Auto (Enabled)	Auto	Auto	Auto
Global C-State control	Disabled	Enabled	Enabled	Enabled
L1 Stream HW Prefetcher	Auto (Enabled)	Auto	Auto	Auto
L2 Stream HW Prefetcher	Auto (Enabled)	Auto	Auto	Auto
Determinism slider	Auto (Power)	Auto	Auto	Auto
CPPC	Auto (Disabled)	Enabled	Auto	Enabled
Power profile selection F19h	High-performance mode	High-performance mode	Maximum I/O performance mode	High-performance mode

Note: BIOS tokens with highlighted are not applicable only for single socket optimized platform like Cisco UCS C225 M8 1U Rack Server.

If your workloads have few vCPUs per virtual machine (less than a quarter of the number of cores per socket), the following settings tend to provide the best performance: NUMA NPS (nodes per socket) = 4, LLC As NUMA turned on.
If your workload virtual machines have a large number of vCPUs (greater than half the number of cores per socket), the following settings tend to provide the best performance: NUMA NPS (nodes per socket) = 1, LLC As NUMA turned off.

For more information, see the VMware vSphere Tuning Guide.

Operating system tuning guidance for high performance

Microsoft Windows, VMware ESXi, Red Hat Enterprise Linux, and SUSE Linux operating systems have default power management features. Tuning the operating system is necessary for optimal performance. For additional performance documentation, see the AMD EPYC performance tuning guides.

Linux (Red Hat and SUSE)

The CPUfreq governor defines system CPU power characteristics, influencing performance. Each governor has unique behavior and suitability for different workloads. The 'performance' governor forces the CPU to its highest clock frequency, statically set and unchanging, offering no power savings. It's suitable for heavy workloads when the CPU is rarely idle. The default 'on demand' setting allows the CPU to reach maximum frequency under high load and minimum frequency when idle, adjusting power consumption at the cost of latency from frequency switching. The performance governor can be set using the cpupower frequency-set -g performance command.

For additional information:

Red Hat Enterprise Linux: Set the performance CPUfreq governor.
SUSE Enterprise Linux Server: Set the performance CPUfreq governor.

Microsoft Windows Server 2019 and 2022

For Microsoft Windows Server 2019, the default 'Balanced' power plan conserves energy but can increase latency and cause performance issues for CPU-intensive applications. For maximum performance, set the power plan to 'High Performance'.

For additional information, see the following link:

Microsoft Windows and Hyper-V: Set the power policy to High Performance.

VMware ESXi

In VMware ESXi, host power management is designed to reduce power consumption. Set the power policy to 'High Performance' to achieve maximum performance.

For additional information, see the following links:

VMware ESXi: Set the power policy to High Performance.

Conclusion

When tuning system BIOS settings for performance, consider processor and memory options. Prioritize performance-optimizing options over power savings if peak performance is the goal. Experiment with settings like memory interleaving and CPU hyperthreading. Crucially, assess the impact of any settings on application performance requirements.

For more information

For more information about the Cisco UCS M8 Server with AMD 4th gen & 5th gen processors, consult the following resources:

Models: C245 M8, Performance Tuning for Cisco UCS M8 Platforms, Tuning for Cisco UCS M8 Platforms, Cisco UCS M8 Platforms, UCS M8 Platforms, M8 Platforms, Platforms

File Info : application/pdf, 27 Pages, 809.93KB

JavaScript is disabled. Open the PDF or Download.

ucs-c245-m8-rack-ser-4th-gen-amd-epyc-pro-wp Microsoft Word for Microsoft 365 䵩捲潳潦璮⁗潲搠景爠䵩捲潳潦琠㌶㔻⁭潤楦楥搠畳楮朠楔數琠㈮ㄮ㜠批‱吳塔

	Cisco UCS C245 M8 SFF Rack Server Spec Sheet and Configuration Guide Detailed specifications and configuration guide for the Cisco UCS C245 M8 SFF Rack Server. This document outlines its capabilities, including AMD EPYC processors, extensive memory support, versatile storage solutions, and advanced networking, designed for enterprise environments.
	Cisco Unified Computing System (UCS) Solution Overview An overview of Cisco Unified Computing System (UCS) components, including unified fabric, management, and computing resources, designed for modern applications, virtualization, and cloud computing.
	Optimizing Cisco ASAv Deployment for Performance and Efficiency A comprehensive guide to optimizing the Cisco Adaptive Security Virtual Appliance (ASAv) deployment, covering licensing, SR-IOV provisioning, performance tuning across VMware and KVM environments, and AWS deployment considerations.
	Cisco UCS Integrated Infrastructure for Big Data and Analytics with Hortonworks Data Platform 3.0 A comprehensive guide detailing the design, deployment, and configuration of Cisco UCS Integrated Infrastructure for Big Data and Analytics, combined with Hortonworks Data Platform 3.0 and NVIDIA GPUs. This document covers architecture, installation steps, software versions, and scaling for enterprise-grade big data and AI/ML workloads.
	Cisco Enterprise NFVIS: Product Overview and Benefits Learn about Cisco Enterprise NFVIS, a Linux-based infrastructure software for designing, deploying, and managing virtualized network functions. Discover its benefits, supported hardware, and virtual machines.
	Cisco UCS C885A M8 Rack Server: High-Performance AI Compute Discover the Cisco UCS C885A M8 Rack Server, a dense GPU server engineered for demanding AI workloads like LLM training, fine-tuning, and inference. Featuring NVIDIA HGX or AMD MI GPUs, it offers scalable accelerated compute capabilities.
	Cisco UCS C220/C240/B200 M5 Memory Guide This guide provides detailed information on memory configurations, population rules, and system speeds for Cisco UCS C220, C240, and B200 M5 servers. It covers DDR4 DIMMs and Intel Optane Persistent Memory (PMEMs), including installation procedures and compatibility notes.
	Second-Generation Intel Xeon Scalable Processor Refresh Selection Guide for VDI on Cisco UCS with VMware Horizon 7 This white paper from Cisco and VMware evaluates the performance and price-to-performance ratio of second-generation Intel Xeon Scalable processors for Virtual Desktop Infrastructure (VDI) deployments using Cisco UCS servers and VMware Horizon 7, covering task, knowledge, and power user profiles.