[PDF] AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION

AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION

22 mar 2024 — on lower-cost mwaitx instructions that can execute at any privilege level. • Performance of binaries compiled with Microsoft Visual Studio 2022 v17.8.6. • ...
[PDF] AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION

AMD Ryzen 9 7950X, 170W TDP, 16 cores, 32 threads, up to 5.7 GHz max boost clock, 4.5. GHz base clock with 2 channels of DDR5 memory. • Two Core Complex Die ( ...
Google Docs
Google Drive
Download [pdf]
File Info : application/pdf, 83 Pages, 2.71MB
GDC2024 AMD Ryzen Processor Software Optimization
AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION
KEN MITCHELL

AGENDA
· Abstract · Speaker Biography · Products · Data Flow · Microarchitecture · Best Practices · Optimizations

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024

2

ABSTRACT

· Break through CPU bottlenecks to reach higher frames-per-second!
· Dive into data flow, simultaneous multithreading, resource sharing, instruction set evolution, cache hierarchies, and coherency.
· Unlock powerful profiling tools and application analysis techniques.
· Discover best practices and lessons learned.
· Attack valuable code optimization opportunities.
· Examples include C/C++, assembly, and hardware performance-monitoring counters.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024

3

SPEAKER BIOGRAPHY
· Ken Mitchell is a Fellow and Technical Lead in the AMD Software Performance Engineering team where he collaborates with Microsoft® Windows® and AMD engineers to optimize AMD processors for better performance-perwatt. He began working at AMD in 2005. His previous work includes helping game developers utilize AMD processors efficiently, analyzing PC applications for performance projections of future AMD products, as well as developing system benchmarks. Ken earned a Bachelor of Science in Computer Science degree at the University of Texas at Austin.
· Kenneth.Mitchell@amd.com

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024

4

PRODUCTS

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024

5

FORMER CODE NAMES

CPU Architecture
"Zen 4"
"Zen 3"
"Zen 2" "Zen"

Mobile (Laptop)
"Hawk Point" AMD Ryzen 8040 Series "Phoenix" AMD Ryzen 7040 Series
"Rembrandt" AMD Ryzen 6000 Series "Cezanne" AMD Ryzen 5000 Series
"Renoir" AMD Ryzen 4000 Series
"Picasso" AMD Ryzen 3000 Series "Raven Ridge" AMD Ryzen 2000 Series

Desktop "Raphael AM5" AMD Ryzen 7000 Series
"Vermeer" AMD Ryzen 5000 Series
"Matisse" AMD Ryzen 3000 Series "Pinnacle Ridge" AMD Ryzen 2000 Series "Summit Ridge" AMD Ryzen 1000 Series

Workstation
"Storm Peak" AMD Threadripper 7000 Series
"Chagall PRO" AMD Threadripper 5000 Series
"Castle Peak" AMD Threadripper 3000 Series
"Threadripper" AMD Threadripper 1000 Series

· Table shown does not include all former code names for each CPU architecture.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024

6

"HAWK POINT" AMD RYZEN 8040 SERIES PROCESSORS

Mobile Model
AMD Ryzen 9 8945HS AMD Ryzen 7 8845HS AMD Ryzen 7 8840U AMD Ryzen 7 8840HS AMD Ryzen 5 8645HS AMD Ryzen 5 8640U AMD Ryzen 5 8640HS AMD Ryzen 5 8540U AMD Ryzen 3 8440U

Cores / Threads

Boost / Base Frequency

8 / 16 8 / 16 8 / 16 8 / 16 6 / 12 6 / 12 6 / 12 6 / 12 4 / 8

Up to 5.2GHz / 4.0GHz Up to 5.1GHz / 3.8GHz Up to 5.1GHz / 3.3GHz Up to 5.1GHz / 3.3GHz Up to 5.0GHz / 4.3GHz Up to 4.9GHz / 3.5GHz Up to 4.9GHz / 3.5GHz Up to 4.9GHz / 3.2GHz Up to 4.7GHz / 3.0GHz

GPU Compute Units
12 12 12 12 8 8 8 4 4

AMD Ryzen
AI Yes Yes Yes Yes Yes Yes Yes No No

TDP
45W 45W 28W 28W 45W 28W 28W 28W 28W

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024

7

"RAPHAEL AM5" AMD RYZEN 7000 SERIES PROCESSORS

Desktop Model
AMD Ryzen 9 7950X3D AMD Ryzen 9 7950X
AMD Ryzen 9 7900X3D AMD Ryzen 9 7900X AMD Ryzen 9 7900
AMD Ryzen 7 7800X3D AMD Ryzen 7 7700X AMD Ryzen 7 7700 AMD Ryzen 5 7600X AMD Ryzen 5 7600 AMD Ryzen 5 7500F

Cores / Threads

Boost / Base Frequency

16 / 32 16 / 32 12 / 24 12 / 24 12 / 24 8 / 16 8 / 16 8 / 16 6 / 12 6 / 12 6 / 12

Up to 5.7GHz / 4.2GHz Up to 5.7GHz / 4.5GHz Up to 5.6GHz / 4.4GHz Up to 5.6GHz / 4.7GHz Up to 5.4GHz / 3.7GHz Up to 5.0GHz / 4.2GHz Up to 5.4GHz / 4.5GHz Up to 5.3GHz / 3.8GHz Up to 5.3GHz / 4.7GHz Up to 5.1GHz / 3.8GHz Up to 5.0GHz / 3.7GHz

GPU Compute Units
2 2 2 2 2 2 2 2 2 2 0

AMD Ryzen
AI No No No No No No No No No No No

TDP
120W 170W 120W 170W 65W 120W 105W 65W 105W 65W 65W

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024

8

"STORM PEAK" AMD THREADRIPPER PRO 7000 WX-SERIES PROCESSORS

Workstation Model
AMD Ryzen Threadripper PRO 7995WX AMD Ryzen Threadripper PRO 7985WX AMD Ryzen Threadripper PRO 7975WX AMD Ryzen Threadripper PRO 7965WX AMD Ryzen Threadripper PRO 7955WX AMD Ryzen Threadripper PRO 7945WX

Cores / Threads

Boost / Base Frequency

96 / 192 64 / 128 32 / 64 24 / 48 16 / 32 12 / 24

Up to 5.1GHz / 2.5GHz Up to 5.1GHz / 3.2GHz Up to 5.3GHz / 4.0GHz Up to 5.3GHz / 4.2GHz Up to 5.3GHz / 4.5GHz Up to 5.3GHz / 4.7GHz

GPU Compute Units
0 0 0 0 0 0

AMD Ryzen
AI No
No
No
No
No
No

TDP
350W 350W 350W 350W 350W 350W

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024

9

DATAFLOW
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 10

"HAWK POINT"

32B fetch

32K I-Cache 8-way

3*32B load 32K
D-Cache 2*32B store 8-way cclk

32B/cycle

32B/cycle

1024K L2
I+D Cache 8-way

32B/cycle 16M L3
I+D Cache 16-
way l3clk

32B/cycle Data Fabric
fclk

Unified 32B/cycle Memory
Controlle uclkr
4x32B/cycle RDNA3 
32B/cycle Media 
32B/cycle NPU 
64B/cycle IO Hub lclk

8B/cycle

DRAM Channel

memclk

· AMD Ryzen 9 8945HS, 35-54W TDP, 8 cores, 16 threads, up to 5.2 GHz max boost clock, 4.0 GHz base clock with 2 channels of DDR5 memory.
· integrated AMD RDNA 3 graphics and Neural Processing Unit (NPU).

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 11

"RAPHAEL AM5"

CCD CCD

IOD

32B fetch

32K I-Cache 8-way

3*32B load 32K
D-Cache 2*32B store 8-way cclk

32B/cycle

32B/cycle

1024K L2
I+D Cache 8-way

32B/cycle

32M L3 32B/cycle R
I+D Cache 16-way 16B/cycle W

Data Fabric

Unified
32B/cycle Memory Controlle
uclkr

2x8B/cycle

DRAM Channel

memclk

2x32B/cycle RDNA2 
32B/cycle Media 

l3clk

fclk

64B/cycle IO Hub lclk

· AMD Ryzen 9 7950X, 170W TDP, 16 cores, 32 threads, up to 5.7 GHz max boost clock, 4.5 GHz base clock with 2 channels of DDR5 memory.
· Two Core Complex Die (CCD). Each CCD has one 32M L3 cache.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 12

"STORM PEAK"

1

45

0 1 2 IOD 6 7 8 2 3IOD 6 7

345

9 AB

32B fetch

32K I-Cache 8-way

3*32B load 32K
D-Cache 2*32B store 8-way cclk

32B/cycle

32B/cycle

1024K L2
I+D Cache 8-way

32B/cycle

32M L3 32B/cycle R
I+D Cache 16-way 16B/cycle W

Data Fabric

Unified
32B/cycle Memory Controlle
uclkr

2x8B/cycle

DRAM Channel

memclk

l3clk

fclk

64B/cycle IO Hub lclk

· AMD Ryzen Threadripper Pro 7995WX, 350W TDP, 96 cores, 192 threads, up to 5.1 GHz boost, 2.5 GHz base with 8 channels of DDR5 memory.
· Three CCDs per Data Fabric quadrant shown.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 13

MICROARCHITECTURE
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 14

AMD "ZEN 4"

32K I-Cache 8 way

Decode
4 i nstructions/cycle
INTEGER

Op Queue
Dispatch
6 ma cro ops/cycle di s patched

Integer Rename

Scheduler Scheduler Scheduler Scheduler

Branch Prediction
Op Cache
9 ma cro ops /cycle

FLOATING POINT Floating Point Rename

Scheduler

Scheduler

Integer Register File

FP/SIMD Register File

ALU BR

AGU

ALU

AGU

ALU

AGU

ALU

BR

MU

MU

F2I ST

L MA

ADD

L MA

ADD

F2I ST

C

C

3 loads per cycle 2 stores per cycle

Load/Store Queues

32K D-Cache 8 Way

1M L2 (I+D) Cache
8 Way

· ~13% higher IPC for desktop. · Increased op cache from 4K to 6.75K ops. · Increased L2 cache from 512 KB to 1024
KB. · Improved load store. · Improved branch prediction. · Added AVX-512 instruction support.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 15

SIMULTANEOUS MULTI-THREADING

Program Threads

A

B

Program Counter #1

Core

Program Counter #2

Thread #1

Thread #2

Architectural Register Set #1

Architectural Register Set #2

· Single-threaded applications do not always occupy all resources of the processor.
· The processor can take advantage of the unused resources to execute a second thread concurrently.
· Although each thread has a program counter and architectural register set, core resources may be shared while operating in two-threaded mode.

Scheduler

Register Files, Execution Units

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 16

CORE RESOURCE SHARING DEFINITIONS

Category
Competitively shared
Watermarked

Definition Resource entries are assigned on demand. A thread may use all resource entries.
Resource entries are assigned on demand. When in two-threaded mode a thread may not use more resource entries than are specified by a watermark threshold.

Statically partitioned

Resource entries are partitioned when entering two-threaded mode. A thread may not use more resource entries than are available in its partition.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 17

AMD "ZEN 4" CORE RESOURCE SHARING

Resource

Competitively

Shared

Watermarked

Integer Scheduler

X

Integer Register File

X

Load Queue

X

Floating Point Physical Register

X

Floating Point Scheduler

X

Memory Request Buffers

X

Op Queue

Store Queue

Write Combining Buffer

X

Retire Queue

Statically Partitioned
X X X

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 18

INSTRUCTION SET EVOLUTION

AVX512* GFNI VAES VPCLMUL CLWB ADX CLFLUSHOPT RDSEED SHA XGETBV XSAVEC XSAVES AVX2 BMI2 MOVBE RDRND FSGSBASE XSAVEOPT BMI FMA F16C AES AVX OSXSAVE PCLMUL SSE4.1 SSE4.2 XSAVE SSSE3 MONITORX CLZERO

Core

AMD "Zen 4" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AMD "Zen 3" 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AMD "Zen 2" 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AMD "Zen 1" 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

"Jaguar"

0000000000000010011011111111100

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 19

AVX512 INSTRUCTION SET EVOLUTION

AVX512_BF16 AVX512_VPOPCNTDQ AVX512_BITALG AVX512_VNNI AVX512_VBMI2 AVX512_VBMI AVX512VL AVX512BW AVX512CD AVX512_IFMA AVX512DQ AVX512F

Core

AMD "Zen 4" 1 1 1 1 1 1 1 1 1 1 1 1

AMD "Zen 3" 0 0 0 0 0 0 0 0 0 0 0 0

AMD "Zen 2" 0 0 0 0 0 0 0 0 0 0 0 0

AMD "Zen 1" 0 0 0 0 0 0 0 0 0 0 0 0

"Jaguar"

000000000000

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 20

SOFTWARE PREFETCH INSTRUCTIONS

Prefetch(T0)|(NTA) Fill lines

L1D
32 KB
L2
1024 KB
L3
32768 KB

Aggressively Evict Prefetch
NTA lines

· Use Software Prefetch instructions on linked data structures experiencing cache misses.
· Use NTA on use once data.
· While in two-threaded mode, beware too many software prefetches may evict the working set of the other thread from their shared caches.
· Prefetch(T0)|(NTA) fills into L1.
· Prefetch(T1)|(T2) fills into L2. · new for AMD "Zen 4"!

Prefetch (T1)|(T2) Fill lines

Memory
Gigabytes

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 21

HARDWARE PREFETCHERS L1

Category L1 Stream L1 Stride L1 Region

Definition
Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.
Uses memory access history of individual instructions to fetch additional lines when each access is a constant distance from the previous.
Uses memory access history to fetch additional lines when the data access for a given instruction tends to be followed by a consistent pattern of other accesses within a localized region.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 22

HARDWARE PREFETCHERS L2

Category

Definition

L2 Stream

Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.

L2 Up/Down Uses memory access history to determine whether to fetch the next or previous line for all memory accesses.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 23

STREAMING HARDWARE PREFETCHER
· Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.

0 40 80 C0 100 140 180 1C0 200 240 280 2C0 300 340 380 3C0 400 440 480 4C0 500 540 580 5C0 600 640 680 6C0 700 740 780 7C0 800

Memory Address Steam +1

1 2 3 4 5 6 7 8 9 10

alignas(64) float a[LEN]; // ... float sum = 0.0f; for (size_t i = 0; i < LEN; i++) {
sum += a[i]; // streaming prefetch }

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 24

0 40 80 C0 100 140 180 1C0 200 240 280 2C0 300 340 380 3C0 400 440 480 4C0 500 540 580 5C0 600 640 680 6C0 700 740 780 7C0 800

STRIDE HARDWARE PREFETCHER
· Uses memory access history of individual instructions to fetch additional lines when each access is a constant distance from the previous.

Memory Address Stride +5 Stride +5

1

2

3

4

5

1

2

3

4

struct S { double x1, y1, z1, w1; char name[256]; double x2, y2, z2, w2; }; alignas(64) S a[LEN]; // ... double sumX1 = 0.0f, sumX2 = 0.0f; for (size_t i = 0; i < LEN; i++) {
sumX1 += a[i].x1; // stride prefetch 0 sumX2 += a[i].x2; // stride prefetch 1 }

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 25

DESKTOP CACHE HIERARCHY EVOLUTION

Core AMD "Zen 4" AMD "Zen 3" AMD "Zen 2" AMD "Zen 1"

uOP/Core K
6.75 4 4 2

L1I/Core KB 32 32 32 64

L1D/Core KB 32 32 32 32

L2/Core KB 1024 512 512 512

L3/CCX MB 32* 32* 16 8

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 26

CACHE-COHERENCY PROTOCOL
· The AMD cache-coherency protocol is MOESI (Modified, Owned, Exclusive, Shared, Invalid).
· Instruction-execution activity and externalbus transactions may change the cache's MOESI state.
· Read hits do not cause a MOESI-state change.
· Write hits generally cause a MOESI-state change into the modified state.
· If the cache line is already in the modified state, a write hit does not change its state.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 27

CACHE-TO-CACHE TRANSFERS

Memory

Data Fabric

CCX0 32MB L3$
with shadow
tags
CCX1 32MB L3$
with shadow
tags

Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7

· Each CPU Complex (CCX) has a L3 cache shared by up to eight cores.
· The L3 cache has shadow tags for each L2 cache within its complex.
· Shadow tags determine if a "fast" cache-tocache transfer between cores within the CCX is possible.
· Cache-coherency probe latency responses may be slower from cores in another CCX.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 28

Core3 Core2 Core1 Core0 Core3 Core2 Core1 Core0

CACHE-COHERENCY EFFICIENCY

CCX0

Data Fabric

CCX1 · Minimize ping-ponging modified cache lines
between cores  especially in another CCX!

· Minimize using Read-Modify-Write instructions.

· Use a single atomic add with a local sum

rather than many atomic increment

operations.

MMMM

· Improve lock efficiency. · "Test and Test-and-Set" in user spin locks

with a pause instruction.

· Replace user spin locks with modern sync

APIs.

· Use a memory allocator optimized for multi-

threading.

· Try mimalloc or jemalloc.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 29

Logical Processor

AMD "PREFERRED CORE"
SchedulingClass (higher is better)
Default EffectivePowerModeGameMode
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

· Some AMD products have cores that are faster than other cores.
· Windows® may use SchedulingClass or EfficiencyClass during thread scheduling. These values may change during runtime.
· Thread affinity masks may interfere with thread scheduling and power management optimizations on Windows PCs.
· Testing done by AMD performance labs January 22, 2023 on an AMD reference motherboard equipped with 16GB DDR56000MHz, Ryzen 9 7950X3D with Nvidia RTX 4090, Win11 Pro x64 22621.1105. Actual results may vary.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 30

BEST PRACTICES
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 31

PREFER SHIPPING CONFIGURATION BUILDS FOR CPU PROFILING

Average FPS

UE5.1 City Sample DX12 1080p (higher is better)

90

80

77

70

60

50

46

40

30

20

10

0

Shipping

Development

Build Configuration

· Disable debug features before you ship!
· Debug and development builds may reduce performance.
· Stats collection may cause cache pollution.
· Logging may create serialization points.
· Debug builds may disable multithreading optimizations.
· Performance of UE4.5.1 binaries compiled with Microsoft® Visual Studio 2022 v17.4.4.
· Testing done by AMD technology labs, January 30, 2023 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Cooler Master MasterLiquid ML360 RGB TR4 Edition, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 7900 XTX GPU with driver 23.1.1 (January 11, 2023), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 version 22H2, 1920x1080 resolution. Actual results may vary.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 32

DISABLE ANTI-TAMPER WHILE CPU PROFILING
· Anti-tamper and Anti-Cheat technologies may prevent CPU debugging and profiling tools from working correctly  especially while loading and retrieving symbol information.
· Create a CPU profiling friendly build configuration similar-to the Shipping configuration but with Anti-Tamper and Anti-Cheat technologies disabled.
· Add this build as a launch option during development. · Remove this build before release.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 33

TEST COLD SHADER CACHE FIRST TIME USER EXPERIENCE
rem Run as administrator rem Disable Steam shader pre-caching before running this script rem Reboot after running this script to clear any shaders still in system memory
setlocal enableextensions cd /d "%~dp0" rmdir /s /q "%LOCALAPPDATA%\D3DSCache" rmdir /s /q "%LOCALAPPDATA%\AMD\DX9Cache" rmdir /s /q "%LOCALAPPDATA%\AMD\DxCache" rmdir /s /q "%LOCALAPPDATA%\AMD\DxcCache" rmdir /s /q "%LOCALAPPDATA%\AMD\OglCache" rmdir /s /q "%LOCALAPPDATA%\AMD\VkCache" rmdir /s /q "%LOCALAPPDATA%\NVIDIA\DXCache" rmdir /s /q "%ProgramFiles(x86)%\Steam\steamapps\shadercache"
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 34

USE THE LATEST COMPILER AND WINDOWS® SDK

Msbuild.exe UE4.sln -target:Engine\UE4:Rebuild -property:Configuration=Shipping -property:Platform=Win64
(less is better)
240 205

180

121

119

120

· Get the latest build and link time improvements.
· Get the latest library and runtime optimizations.
· Performance of UE4.27.2 binaries compiled with Microsoft® Visual Studio.
· Testing done by AMD technology labs, February 5, 2022 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Enermax LIQTECH TR4 II series 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 6800 XT GPU with driver 21.10.2 (October 25, 2021), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 21H2, 1920x1080 resolution. Actual results may vary.

seconds

60

0 2017 v15.9.43 2019 v16.11.9 2022 v17.05 Visual Studio Build Tools

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 35

ADD VIRUS AND THREAT PROTECTION EXCLUSIONS

Msbuild.exe UE5.sln

-target:Engine\UE5:Rebuild

-property:Configuration=Shipping

-property:Platform=Win64

(less is better)

240

224

182

180

120

60

0

· WARNING: Not recommended for CI/CD systems. Exclusions may make your device vulnerable to threats.
· Add project folders to virus and threat protection settings exclusions for faster build times.
· Faster rebuild time after optimization!
· Performance of UE5.1 binaries compiled with Microsoft® Visual Studio 2022 v17.4.4.
· Testing done by AMD technology labs, January 28, 2023 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Cooler Master MasterLiquid ML360 RGB TR4 Edition, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 7900 XTX GPU with driver 23.1.1 (January 11, 2023), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 version 22H2, 1920x1080 resolution. Actual results may vary.

seconds None C:\UnrealEngin e-5.1

Folder Exclusions

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 36

REDUCE BUILD TIMES

Msbuild.exe UE4.sln

-target:Engine\UE4:Rebuild

-property:Configuration=Shipping

-property:Platform=Win64

(less is better)

240

231

seconds

180 119
120

60

0

VS2017, Without Virus VS2022, With Virus

Exclusion Folders

Exclusion Folders

System Configuration

· Performance of UE4.27.2 binaries compiled with Microsoft® Visual Studio.
· Testing done by AMD technology labs, February 5, 2022 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Enermax LIQTECH TR4 II series 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 6800 XT GPU with driver 21.10.2 (October 25, 2021), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 21H2, 1920x1080 resolution. Actual results may vary.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 37

USE AVX OR AVX2 IF CPU MINIMUM REQUIREMENTS ALLOW

Steam Hardware & Software Survey: January 2024
(higher is better)

0%

20%

40%

60%

80%

100%

SSE2

100%

AVX

97%

AVX2

92%

· A binary may have better code generation using AVX or later ISA by using the Microsoft® Visual C compiler option /arch:[AVX|AVX2|AVX512].
· Minimum hardware requirements: · Windows 10 = SSE2 · Windows 11 = SSE4.1
· The Windows 10 supported processor list includes AMD products which support AVX but not AVX2.
· The Windows 10 supported processor list may include products from other CPU vendors which do not support AVX.

AVX512 11%

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 38

ENABLE AVX512 IN DEVELOPMENT TOOLS

embree-3.13.5.x64.vc14.windows pathtracer_ispc.exe -c asian_dragon.ecs
--fullscreen --print-frame-rate (higher is better)

FPS

0

5

10

15

20

25

30

· Development tools may benefit from AVX512.
· Examples: · Light Baking. · Texture Compression. · Mesh to Signed Distance Fields.

Disabled

23

· Testing done by AMD technology labs, January 29, 2023 on the following

system. Test configuration: AMD Ryzen 7950X, NZXT Kraken X62

cooler, 32GB (2 x 16GB DDR5-6000 30-38-38-96) memory, AMD

Radeon RX 7900 XTX GPU with driver 23.1.1 (January 11, 2023), 2TB

M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 build

22H2, 1920x1080 resolution. Actual results may vary.

AVX512

Enabled

27

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 39

AUDIT CONTENT
· Ask artists to recommend profiling scenes of interest!
· For example, an indoor dungeon, an outdoor city, an outdoor forest, large crowds, or a specific time of day.
· Run Unreal Engine MapCheck!
· It may find some performance issues. · https://docs.unrealengine.com/en-US/BuildingWorlds/LevelEditor/MapErrors/index.html
· Use Unity AssetPostprocessor!
· Enforce minimum standards. · https://docs.unity3d.com/Manual/BestPracticeUnderstandingPerformanceInUnity4.html
· Check stats before CPU profiling!
· If a scene far exceeds its draw budget or has many duplicate objects, report the issue to its artists and consider profiling a different scene. Otherwise, you may risk profiling hot spots which may not be hot after the art issues are resolved.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 40

SUPPORT HYBRID GRAPHICS

· Use IDXGIFactory6:: EnumAdapterByGpuPreference DXGI_GPU_PREFERENCE_HIGH_PERF ORMANCE for game applications.
· The user may change preferences per application in Graphics settings.
· Testing done by AMD performance labs January 24, 2022 on a Dell G5 15 SE laptop equipped with, 16GB DDR4-3200MHz, Ryzen 9 4900H with Radeon RX 5600M, Win11 Pro x64 22000.434.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 41

USE PREFERRED VIDEO AND AUDIO CODECS
· Prefer H264 video and AAC audio codecs as recommended by the Unreal Engine Electra Media Player.
· Hardware accelerated codecs may increase hours of battery life and reduce CPU work.
· AMD Radeon graphics devices released since 2022 no longer accelerate WMV3 decoding.
· See amd.com product specifications for supported rendering formats.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 42

OPTIMIZATIONS
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 43

SYNC APIS
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 44

USE MODERN SYNC APIS
Exclusive Lock Test (less is better)
Core Isolation Memory Integrity Off Core Isolation Memory Integrity On 100% 80% 60% 40% 20% 0%

· Avoid user spin locks that starve payload work on other ready threads, consume excessive power, and drain laptop batteries.
· Performance of binaries compiled with Microsoft® Visual Studio 2022 v17.8.6.
· Testing done by AMD technology labs, January 6, 2024 on the following system. Test configuration: AMD Ryzen Threadripper 7995WX 96Cores, NZXT Kraken 360 cooler, 256GB (8 x 32GB RDDR5-4800 memory, AMD Radeon RX 580 GPU with 31.0.12027.9001 (March 20, 2023), 1TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 23H2, 1920x1080 resolution. Actual results may vary.

Total CPU Utilization at start of test

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 45

milliseconds

USE MODERN SYNC APIS
Exclusive Lock Test (less is better)
Core Isolation Memory Integrity Off Core Isolation Memory Integrity On 200,000 150,000 100,000 50,000 0

· Prefer std::mutex which has good performance and low CPU utilization.
· Legacy sync APIs like WaitForSingleObject may rely on expensive syscall instructions.
· Modern sync APIs like std::mutex may rely on lower-cost mwaitx instructions that can execute at any privilege level.
· Performance of binaries compiled with Microsoft® Visual Studio 2022 v17.8.6.
· Testing done by AMD technology labs, January 6, 2024 on the following system. Test configuration: AMD Ryzen Threadripper 7995WX 96Cores, NZXT Kraken 360 cooler, 256GB (8 x 32GB RDDR5-4800 memory, AMD Radeon RX 580 GPU with 31.0.12027.9001 (March 20, 2023), 1TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 23H2, 1920x1080 resolution. Actual results may vary.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 46

USE MODERN SYNC APIS: SHARED CODE

#include "intrin.h" #include <chrono> #include <numeric> #include <thread> #include <vector> #include <mutex> #include <Windows.h> #define LEN 128
alignas(64) float b[LEN][4][4]; alignas(64) float c[LEN][4][4];

int main(int argc, char* argv[]) { using namespace std::chrono; float b0 = (argc > 1) ? strtof(argv[1], NULL) : 1.0f; float c0 = (argc > 2) ? strtof(argv[2], NULL) : 2.0f; std::fill((float*)b, (float*)(b + LEN), b0); std::fill((float*)c, (float*)(c + LEN), c0); size_t num_threads = \ GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS); wprintf(L"num_threads: %llu\n", num_threads); std::vector<std::thread> threads = {}; auto t0 = high_resolution_clock::now(); for (size_t i = 0; i < num_threads; ++i) { threads.push_back(std::thread(fn)); } for (size_t i = 0; i < num_threads; ++i) { threads[i].join(); } auto t1 = high_resolution_clock::now(); wprintf(L"time (ms): %lli\n", \ duration_cast<milliseconds>(t1 - t0).count()); return EXIT_SUCCESS;
}

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 47

USE MODERN SYNC APIS: BAD USER SPIN LOCK

namespace MyLock { typedef unsigned LOCK, *PLOCK; enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; void Lock(PLOCK pl) { while (LOCK_IS_TAKEN == \ _InterlockedCompareExchange(\ reinterpret_cast<long*>(pl), \ LOCK_IS_TAKEN, LOCK_IS_FREE)) { } } void Unlock(PLOCK pl) { _InterlockedExchange(reinterpret_cast<long*>(pl),\ LOCK_IS_FREE); }
}
MyLock::LOCK gLock;

void fn() { alignas(64) float a[LEN][4][4]; std::fill((float*)a, (float*)(a + LEN), 0.0f);
float r = 0.0; for (size_t iter = 0; iter < 100000; iter++) {
MyLock::Lock(&gLock); for (int m = 0; m < LEN; m++)
for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) for (int k = 0; k < 4; k++) a[m][i][j] += b[m][i][k] * c[m][k][j];
r += std::accumulate((float*)a, \ (float*)(a + LEN), 0.0f);
MyLock::Unlock(&gLock); } wprintf(L"result: %f\n", r); }

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 48

USE MODERN SYNC APIS: IMPROVED USER SPIN LOCK

namespace MyLock { typedef unsigned LOCK, *PLOCK; enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; void Lock(PLOCK pl) { while ((LOCK_IS_TAKEN == *pl) || \ (LOCK_IS_TAKEN == \ _InterlockedExchange(pl, LOCK_IS_TAKEN))) { _mm_pause(); } } void Unlock(PLOCK pl) { _InterlockedExchange(reinterpret_cast<long*>(pl),\ LOCK_IS_FREE); }
}
alignas(64) MyLock::LOCK gLock;

void fn() { alignas(64) float a[LEN][4][4]; std::fill((float*)a, (float*)(a + LEN), 0.0f);
float r = 0.0; for (size_t iter = 0; iter < 100000; iter++) {
MyLock::Lock(&gLock); for (int m = 0; m < LEN; m++)
for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) for (int k = 0; k < 4; k++) a[m][i][j] += b[m][i][k] * c[m][k][j];
r += std::accumulate((float*)a, \ (float*)(a + LEN), 0.0f);
MyLock::Unlock(&gLock); } wprintf(L"result: %f\n", r); }

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 49

USE MODERN SYNC APIS: WAITFORSINGLEOBJECT

// MyLock not required. Let the OS do the work!
HANDLE hMutex;
int main(int argc, char* argv[]) { hMutex = CreateMutex(NULL,FALSE,NULL); // otherwise main is the same as before. // ...
}

void fn() { alignas(64) float a[LEN][4][4]; std::fill((float*)a, (float*)(a + LEN), 0.0f);
float r = 0.0; for (size_t iter = 0; iter < 100000; iter++) {
WaitForSingleObject(hMutex, INFINITE); for (int m = 0; m < LEN; m++)
for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) for (int k = 0; k < 4; k++) a[m][i][j] += b[m][i][k] * c[m][k][j];
r += std::accumulate((float*)a, \ (float*)(a + LEN), 0.0f);
ReleaseMutex(hMutex); } wprintf(L"result: %f\n", r); }

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 50

USE MODERN SYNC APIS: STD::MUTEX

// MyLock not required. Let the OS do the work! std::mutex mutex;

void fn() { alignas(64) float a[LEN][4][4]; std::fill((float*)a, (float*)(a + LEN), 0.0f);
float r = 0.0; for (size_t iter = 0; iter < 100000; iter++) {
mutex.lock(); for (int m = 0; m < LEN; m++)
for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) for (int k = 0; k < 4; k++) a[m][i][j] += b[m][i][k] * c[m][k][j];
r += std::accumulate((float*)a, \ (float*)(a + LEN), 0.0f);
mutex.unlock(); } wprintf(L"result: %f\n", r); }

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 51

USE MODERN SYNC APIS
· Prefer functions using mwaitx
· std::mutex · std::shared_mutex · AcquireSRWLockExclusive · AcquireSRWLockShared · SleepConditionVariableSRW · SleepConditionVariableCS · EnterCriticalSection

· Avoid functions using syscall
· WaitForSingleObject · WaitForMultipleObjects

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 52

WINDOWS PERFORMANCE ANALYZER  SPIN LOCK
wpr.exe start cpu rem run test pause wpr.exe stop log.etl
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 53

WINDOWS PERFORMANCE ANALYZER  STD::MUTEX
wpr.exe start cpu rem run test pause wpr.exe stop log.etl
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 54

VISUAL STUDIO CONCURRENCY VISUALIZER  SPIN LOCK
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 55

VISUAL STUDIO CONCURRENCY VISUALIZER  STD::MUTEX
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 56

THREADING
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 57

TUNE THREAD POOL SIZE FOR INITIALIZATION AND GAME PLAY
· This advice is specific to AMD processors and is not general guidance for all processor vendors.
· Profile your game to determine the optimal thread pool size for both game initialization and play. · Utilizing all logical processors in SMT dual-thread mode may benefit game initialization. · Utilizing only physical cores, each in single-thread, mode may benefit game play.
· for systems with at least 8 AMD Ryzen CPU cores. · See the core count code sample at https://gpuopen.com/learn/cpu-core-counts/.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 58

AVOID HARD AFFINITY MASKS ON PC

· Hard affinity masks interfere with OS power management and thread scheduling.
· CPU Sets provide APIs to declare application affinity in a 'soft' manner that is compatible with OS power management.

Minimum OS

Affinity Type Function

Windows XP hard

SetThreadAffinityMask

Windows 7 hard

SetThreadGroupAffinity

Windows 10 soft

SetThreadSelectedCpuSets

Windows 11 soft

SetThreadSelectedCpuSetMasks

My thread hard affinity = none

t0

t1

CPU0

My thread

Other app

CPU1

idle

My thread

t2 idle My thread

t3 idle My thread

My thread hard affinity = CPU0

t0

T1

CPU0

My thread

Other app

CPU1

idle

Idle

t2 My thread idle

t3 My thread idle

My thread soft affinity = CPU0

t0

T1

CPU0

My thread

Other app

CPU1

idle

My thread

t2 My thread idle

t3 My thread idle

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 59

BEWARE OF PRIORITY BOOST

· Thread Priority describes the order in which threads are scheduled.
· Each thread has a dynamic priority.

BASE PRIORITY AT THREAD PRIORITY LEVEL
THREAD_PRIORITY_IDLE

NORMAL_ PRIORIT Y_
CLASS

ABOVE_ NORMAL_ PRIORIT Y_
CLASS

1

1

· The system boosts the dynamic priority under THREAD_PRIORITY_LOWEST

6

8

certain conditions.

THREAD_PRIORITY_BELOW_NORMAL

7

9

T HREAD_PRIORIT Y_NORMAL

8

10

· Temporary priority-boosted threads may

switch-in before threads intended to be

T HREAD_PRIORIT Y_ABOVE_NORMAL

9

11

higher priority by the developer.

T HREAD_PRIORIT Y_HIGHEST

10

12

· This feature can be disabled using SetProcessPriorityBoost and SetThreadPriorityBoost.

THREAD_PRIORITY_TIME_CRITICAL

15

15

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 60

DATA ACCESS
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 61

ALIGN MEMCPY SOURCE AND DESTINATION POINTERS
· Update the compiler for the latest memcpy, memset, and other C runtime optimizations!
· Memcpy behavior is undefined if dest and src overlap, but the compiler may generate Rep Move String instructions which have defined overlapping behavior.
· Alignas(64) may allow faster rep movs microcode.
· Alignas(4096) may reduce store-to-load conflicts and benefit probe filtering on some processors. · PMCx024 LsBadStatus2 StliOther counts store-to-load conflicts where a load was unable to complete due to a non-forwardable conflict with an older store.
· Aligning to the bit_floor may provide a good balance of cache hits and alignment: · std::clamp(std::bit_floor(count), 4, 4096);
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 62

AVOID FALSE SHARING

30,000 25,000 20,000

False Sharing Test (less is better)
23,975

milliseconds

15,000 10,000

5,000 0

2,557

before

after

optimization

· "False sharing" may occur when two or more cores modify different data within the same cache line.
· This microbenchmark showed its execution time reduced by about 90% after optimization using alignas(64)!
· Performance of binaries compiled with Microsoft® Visual Studio 2022 v17.8.6.
· Testing done by AMD technology labs, January 6, 2024 on the following system. Test configuration: AMD Ryzen Threadripper 7995WX 96Cores, NZXT Kraken 360 cooler, 256GB (8 x 32GB RDDR5-4800 memory, AMD Radeon RX 580 GPU with 31.0.12027.9001 (March 20, 2023), 1TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 23H2, 1920x1080 resolution. Actual results may vary.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 63

AVOID FALSE SHARING
#include <chrono> #include <numeric> #include <thread> #include <vector> #include <Windows.h>
#if defined (APPLY_OPTIMIZATION) /* 64 bytes */ struct alignas(64) ThreadData { unsigned long sum; }; #else /* 4 bytes */ struct ThreadData { unsigned long sum; }; #endif
using namespace std::chrono; #define NUM_ITER 100000000
void fn(ThreadData* p, size_t seed) { srand(static_cast<unsigned int>(seed)); p->sum = 0; for (int i = 0; i < NUM_ITER; i++) p->sum += rand() % 2;
}

int main(int argc, char* argv[]) { size_t num_threads = \ GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS); wprintf(L"num_threads: %llu\n", num_threads); ThreadData* a = static_cast<ThreadData*>(_aligned_malloc( num_threads * sizeof(ThreadData), 64)); if (nullptr == a) return EXIT_FAILURE; std::vector<std::thread> threads = {}; auto t0 = high_resolution_clock::now(); for (size_t i = 0; i < num_threads; ++i) threads.push_back(std::thread(fn, &a[i], i)); for (size_t i = 0; i < num_threads; ++i) threads[i].join(); auto t1 = high_resolution_clock::now(); wprintf(L"time (ms): %lli\n", duration_cast<milliseconds>(t1 - t0).count()); for (size_t i = 0; i < num_threads; ++i) wprintf(L"sum[%llu] = %lu\n", i, (*(a + i)).sum); _aligned_free(a); return EXIT_SUCCESS;
}

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 64

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 65

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 66

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 67

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 68

USE SOFTWARE PREFETCH INSTRUCTIONS FOR LINKED DATA

Nvidia PhysX 4.1 KaplaDemo
AMD Ryzen 7 4700G, NVidia GeForce RTX (higher is better)
250

2080

210 200

150 125

· Over 60% faster after optimization!
· Performance of binaries compiled with Microsoft® Visual Studio 2019 v16.8.3.
· Testing done by AMD technology labs, January 4, 2021 on the following system. Test configuration: AMD Ryzen 7 4700G, AMD Wraith Spire Cooler, 16GB (2 x 8GB DDR4-3200 at 22-22-22-52) memory, NVidia GeForce RTX 2080 GPU with driver 460.89 (December 15, 2020), 512GB M.2 NVME SSD, AMD Ryzen Reference Motherboard, Windows® 10 x64 build 20H2, 1920x1080 resolution. Actual results may vary

100

Average FPS At start of demo

50

0 before

after

optimization

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 69

USE SOFTWARE PREFETCH INSTRUCTIONS FOR LINKED DATA

// Copyright (c) 2021 NVIDIA Corporation. All rights reserved // ConvexRenderer.cpp from https://github.com/NVIDIAGameWorks/PhysX/tree/4.1/physx void ConvexRenderer::updateTransformations() { for (int i = 0; i < (int)mGroups.size(); i++) {
ConvexGroup *g = mGroups[i]; if (g->texCoords.empty()) continue; float* tt = &g->texCoords[0]; for (int j = 0; j < (int)g->convexes.size(); j++) { const Convex* c = g->convexes[j]; #if defined(APPLY_OPTIMIZATION) int distance = 4; // TODO find ideal number size_t future = (j + distance) % g->convexes.size(); _mm_prefetch(0x0F8 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mPxActor _mm_prefetch(0x100 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mLocalPose _mm_prefetch(0x148 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.x _mm_prefetch(0x14C + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.y _mm_prefetch(0x150 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.z _mm_prefetch(0x164 + (char*)(g->convexes[future]), _MM_HINT_NTA); //mSurfaceMaterialId _mm_prefetch(0x160 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialId #endif

PxMat44 pose(c->getGlobalPose()); float* mp = (float*)pose.front(); float* ta = tt; for (int k = 0; k < 16; k++) { *(tt++) = *(mp++); } PxVec3 matOff = c->getMaterialOffset(); ta[3] = matOff.x; ta[7] = matOff.y; ta[11] = matOff.z; int idFor2DTex = c->getSurfaceMaterialId(); int idFor3DTex = c->getMaterialId(); const int MAX_3D_TEX = 8; ta[15] = (float)(idFor2DTex*MAX_3D_TEX + idFor3DTex); } glBindTexture(GL_TEXTURE_2D, g->matTex); glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, g->texSize, g->texSize, GL_RGBA, GL_FLOAT, &g->texCoords[0]); glBindTexture(GL_TEXTURE_2D, 0);
} }

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 70

AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS

milliseconds

mesh_to_sdf.exe --maxload AVX2(8-wide) (less is better)

45,000 40,000

39,623

35,000

30,000

25,000

20,000

15,000

12,589

10,000

5,000

0

before

after

optimization

· For AMD "Zen 2" and "Zen 3" CPUs, there is a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero data.
· Benchmark execution time was reduced by over 60% after a VZeroUpper optimization.
· Performance of binaries compiled with Microsoft® Visual Studio 2022 v17.8.6.
· Testing done by AMD technology labs, February 8, 2024 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Enermax LIQTECH TR4 II series 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 6700 XT GPU with driver 24.1.1 (January 11, 2024), 1TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 23H2, 1920x1080 resolution. Actual results may vary.

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 71

AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS
· Use "SSE_AVX_STALLS" PMCx00E Floating Point Dispatch Faults > 0 to find code which may be missing VZeroUpper or VZeroAll instructions during AVX to SSE and SSE to AVX transitions.
· Optimization 1: · Use the /arch:AVX compiler flag. · AVX is supported by 97% of users according to the January 2024 Steam Hardware & Software Survey.
· Optimization 2: · Return a __m256 value using pass-by-reference in the function parameter list rather than the function return type.
· Optimization 3: · Use __forceinline on the function definition.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 72

AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS

// Before Optimization __m256 udTriangle_sq_precalc_SIMD_8grid(
const __m256 p_x, const __m256 p_y, const __m256 p_z, const tri_precalc_t &pc ) { // ... __m256 res = _mm256_blendv_ps( res1, res0,
cmp );
return res; }

// After Optimization void udTriangle_sq_precalc_SIMD_8grid(
const __m256 p_x, const __m256 p_y, const __m256 p_z, const tri_precalc_t& pc, __m256 &ret ) { // ... ret = _mm256_blendv_ps( res1, res0,
cmp ); }

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 73

Before the optimization, SSE_AVX_STALLS may occur because
there is no VZeroUpper or VZeroAll instruction during the AVX to SSE
transition.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 74

After the optimization, SSE_AVX_STALLS have been reduced
because there is a VZeroUpper instruction during the AVX to SSE
transition.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 75

DO YOU WANT TO KNOW MORE?
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 76

SOFTWARE OPTIMIZATION GUIDES AT AMD.COM

CPU Architecture AMD "Zen 4" AMD "Zen 3"
AMD "Zen 2"
AMD "Zen 1"

Publication No. 57647 56665
56305
55723

Nam e
Software Optimization Guide for the AMD "Zen4" Microarchitecture
Software Optimization Guide for AMD EPYC 7003 Processors (formerly Software Optimization Guide for AMD Family 19h Processors)
Software Optimization Guide for AMD EPYC 7002 Processors (formerly Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors)
Software Optimization Guide for AMD Family 17h Processors

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 78

John.Hartwig@amd.com

Kenneth.Mitchell@amd.com

AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 79

Design faster. Render faster. Iterate faster.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 80

DISCLAIMER AND NOTICES
Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this informat ion and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. THIS INFORMATION IS PROVIDED `AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD is not responsible for any electronic virus or damage or losses therefrom that may be caused by changes or modifications that you make to your system, including but not limited to antivirus software. Changes to your system configurations and settings, including but not limited to antivirus software, is done at your sole discretion and under no circumstances will AMD be liable to you for any such changes. You assume all risk and are solely responsible for any damages that may arise from or are related to changes that you make to your system, including but not limited to antivirus software. AMD, the AMD Arrow logo, Ryzen , Threadripper , Radeon , and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. Microsoft, Windows, and Visual Studio are registered trademarks of Microsoft Corporation in the US and/ or other countries. Unreal® is a trademark or registered trademark of Epic Games, Inc. in the United States of America and elsew here. NVIDIA is a trademark and/or registered trademark of NVIDIA Corporation in the U.S. and/or other countries. Steam is a trademark and/or registered trademark of Valve Corporation. PCIe is a registered trademark of PCI-SIG. AMD products or technologies may include hardware to accelerate encoding or decoding of certain video standards but require t he use of additional programs/applications. ©2024 Advanced Micro Devices, Inc. All rights reserved.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 81

DISCLAIMER AND NOTICES
· Code sample on slide 70 is modified. · Copyright (c) 2024 NVIDIA Corporation. All rights reserved. Code Sample is licensed subject to the following:
"Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE." · MeshToSDF, Copyright 2024 Mikkel Gjoel under MIT License. https://github.com/pixelmager/MeshToSDF · Infiltrator Demo and City Sample use the Unreal® Engine. Unreal® is a trademark or registered trademark of Epic Games, Inc. in the United States of America and elsewhere. · Unreal® Engine, Copyright 1998  2024, Epic Games, Inc. All rights reserved. · Intel® Embree is released as Open Source under the Apache 2.0 license.
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 82

DISCLAIMER AND NOTICES
· Claim "Zen 4" average 13% IPC uplift compared to "Zen 3" desktop processors · RPL-005: Testing as of 15 August, 2022, by AMD Performance Labs using the following hardware: AMD AM5 Reference Motherboard with AMD Ryzen 7 7700X with G.Skill DDR5-6000C30 (F5-6000J3038F16GX2-TZ5N) with AMD EXPO loaded, AMD AM4 Reference Motherboard with AMD Ryzen 7 5800X and DDR4-3600C16. Processors fixed to 4GHz frequency with 8C16 enabled and evaluated with 22 different workloads. ALL SYSTEMS configured with NXZT Kraken X63, open air test bench, Radeon RX 6950XT (driver 22.7.1 Optional), Windows® 11 22000.856, AMD Smart Access Memory/PCIe® Resizable Base Address Register ("ReBAR") ON, VirtualizationBased Security (VBS) OFF. Results may vary.
· Design faster. Render faster. Iterate faster. Create more, faster with AMD Ryzen processors · Testing by AMD Performance Labs as of September 23, 2020 using a Ryzen 9 5950X and Intel Core i9-10900K configured with DDR4-3600C16 and NVIDIA GeForce RTX 2080 Ti. Results may vary. R5K-039
· The information contained herein is for informational purposes only, and is subject to change without notice. Timelines, roadmaps, and/or product release dates shown in these slides are plans only and subject to change. "Navi", "Vega", "Polaris", "Zen, "Zen+", "Zen 2", "Zen 3", and "Zen 4" are codenames for AMD architectures, and are not product names. GD-122
AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 83