22 mar 2024 — on lower-cost mwaitx instructions that can execute at any privilege level. • Performance of binaries compiled with Microsoft Visual Studio 2022 v17.8.6. • ...
AMD Ryzen 9 7950X, 170W TDP, 16 cores, 32 threads, up to 5.7 GHz max boost clock, 4.5. GHz base clock with 2 channels of DDR5 memory. • Two Core Complex Die ( ...
AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION KEN MITCHELL AGENDA · Abstract · Speaker Biography · Products · Data Flow · Microarchitecture · Best Practices · Optimizations AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 2 ABSTRACT · Break through CPU bottlenecks to reach higher frames-per-second! · Dive into data flow, simultaneous multithreading, resource sharing, instruction set evolution, cache hierarchies, and coherency. · Unlock powerful profiling tools and application analysis techniques. · Discover best practices and lessons learned. · Attack valuable code optimization opportunities. · Examples include C/C++, assembly, and hardware performance-monitoring counters. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 3 SPEAKER BIOGRAPHY · Ken Mitchell is a Fellow and Technical Lead in the AMD Software Performance Engineering team where he collaborates with Microsoft® Windows® and AMD engineers to optimize AMD processors for better performance-perwatt. He began working at AMD in 2005. His previous work includes helping game developers utilize AMD processors efficiently, analyzing PC applications for performance projections of future AMD products, as well as developing system benchmarks. Ken earned a Bachelor of Science in Computer Science degree at the University of Texas at Austin. · Kenneth.Mitchell@amd.com AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 4 PRODUCTS AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 5 FORMER CODE NAMES CPU Architecture "Zen 4" "Zen 3" "Zen 2" "Zen" Mobile (Laptop) "Hawk Point" AMD Ryzen 8040 Series "Phoenix" AMD Ryzen 7040 Series "Rembrandt" AMD Ryzen 6000 Series "Cezanne" AMD Ryzen 5000 Series "Renoir" AMD Ryzen 4000 Series "Picasso" AMD Ryzen 3000 Series "Raven Ridge" AMD Ryzen 2000 Series Desktop "Raphael AM5" AMD Ryzen 7000 Series "Vermeer" AMD Ryzen 5000 Series "Matisse" AMD Ryzen 3000 Series "Pinnacle Ridge" AMD Ryzen 2000 Series "Summit Ridge" AMD Ryzen 1000 Series Workstation "Storm Peak" AMD Threadripper 7000 Series "Chagall PRO" AMD Threadripper 5000 Series "Castle Peak" AMD Threadripper 3000 Series "Threadripper" AMD Threadripper 1000 Series · Table shown does not include all former code names for each CPU architecture. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 6 "HAWK POINT" AMD RYZEN 8040 SERIES PROCESSORS Mobile Model AMD Ryzen 9 8945HS AMD Ryzen 7 8845HS AMD Ryzen 7 8840U AMD Ryzen 7 8840HS AMD Ryzen 5 8645HS AMD Ryzen 5 8640U AMD Ryzen 5 8640HS AMD Ryzen 5 8540U AMD Ryzen 3 8440U Cores / Threads Boost / Base Frequency 8 / 16 8 / 16 8 / 16 8 / 16 6 / 12 6 / 12 6 / 12 6 / 12 4 / 8 Up to 5.2GHz / 4.0GHz Up to 5.1GHz / 3.8GHz Up to 5.1GHz / 3.3GHz Up to 5.1GHz / 3.3GHz Up to 5.0GHz / 4.3GHz Up to 4.9GHz / 3.5GHz Up to 4.9GHz / 3.5GHz Up to 4.9GHz / 3.2GHz Up to 4.7GHz / 3.0GHz GPU Compute Units 12 12 12 12 8 8 8 4 4 AMD Ryzen AI Yes Yes Yes Yes Yes Yes Yes No No TDP 45W 45W 28W 28W 45W 28W 28W 28W 28W AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 7 "RAPHAEL AM5" AMD RYZEN 7000 SERIES PROCESSORS Desktop Model AMD Ryzen 9 7950X3D AMD Ryzen 9 7950X AMD Ryzen 9 7900X3D AMD Ryzen 9 7900X AMD Ryzen 9 7900 AMD Ryzen 7 7800X3D AMD Ryzen 7 7700X AMD Ryzen 7 7700 AMD Ryzen 5 7600X AMD Ryzen 5 7600 AMD Ryzen 5 7500F Cores / Threads Boost / Base Frequency 16 / 32 16 / 32 12 / 24 12 / 24 12 / 24 8 / 16 8 / 16 8 / 16 6 / 12 6 / 12 6 / 12 Up to 5.7GHz / 4.2GHz Up to 5.7GHz / 4.5GHz Up to 5.6GHz / 4.4GHz Up to 5.6GHz / 4.7GHz Up to 5.4GHz / 3.7GHz Up to 5.0GHz / 4.2GHz Up to 5.4GHz / 4.5GHz Up to 5.3GHz / 3.8GHz Up to 5.3GHz / 4.7GHz Up to 5.1GHz / 3.8GHz Up to 5.0GHz / 3.7GHz GPU Compute Units 2 2 2 2 2 2 2 2 2 2 0 AMD Ryzen AI No No No No No No No No No No No TDP 120W 170W 120W 170W 65W 120W 105W 65W 105W 65W 65W AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 8 "STORM PEAK" AMD THREADRIPPER PRO 7000 WX-SERIES PROCESSORS Workstation Model AMD Ryzen Threadripper PRO 7995WX AMD Ryzen Threadripper PRO 7985WX AMD Ryzen Threadripper PRO 7975WX AMD Ryzen Threadripper PRO 7965WX AMD Ryzen Threadripper PRO 7955WX AMD Ryzen Threadripper PRO 7945WX Cores / Threads Boost / Base Frequency 96 / 192 64 / 128 32 / 64 24 / 48 16 / 32 12 / 24 Up to 5.1GHz / 2.5GHz Up to 5.1GHz / 3.2GHz Up to 5.3GHz / 4.0GHz Up to 5.3GHz / 4.2GHz Up to 5.3GHz / 4.5GHz Up to 5.3GHz / 4.7GHz GPU Compute Units 0 0 0 0 0 0 AMD Ryzen AI No No No No No No TDP 350W 350W 350W 350W 350W 350W AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 9 DATAFLOW AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 10 "HAWK POINT" 32B fetch 32K I-Cache 8-way 3*32B load 32K D-Cache 2*32B store 8-way cclk 32B/cycle 32B/cycle 1024K L2 I+D Cache 8-way 32B/cycle 16M L3 I+D Cache 16- way l3clk 32B/cycle Data Fabric fclk Unified 32B/cycle Memory Controlle uclkr 4x32B/cycle RDNA3 32B/cycle Media 32B/cycle NPU 64B/cycle IO Hub lclk 8B/cycle DRAM Channel memclk · AMD Ryzen 9 8945HS, 35-54W TDP, 8 cores, 16 threads, up to 5.2 GHz max boost clock, 4.0 GHz base clock with 2 channels of DDR5 memory. · integrated AMD RDNA 3 graphics and Neural Processing Unit (NPU). AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 11 "RAPHAEL AM5" CCD CCD IOD 32B fetch 32K I-Cache 8-way 3*32B load 32K D-Cache 2*32B store 8-way cclk 32B/cycle 32B/cycle 1024K L2 I+D Cache 8-way 32B/cycle 32M L3 32B/cycle R I+D Cache 16-way 16B/cycle W Data Fabric Unified 32B/cycle Memory Controlle uclkr 2x8B/cycle DRAM Channel memclk 2x32B/cycle RDNA2 32B/cycle Media l3clk fclk 64B/cycle IO Hub lclk · AMD Ryzen 9 7950X, 170W TDP, 16 cores, 32 threads, up to 5.7 GHz max boost clock, 4.5 GHz base clock with 2 channels of DDR5 memory. · Two Core Complex Die (CCD). Each CCD has one 32M L3 cache. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 12 "STORM PEAK" 1 45 0 1 2 IOD 6 7 8 2 3IOD 6 7 345 9 AB 32B fetch 32K I-Cache 8-way 3*32B load 32K D-Cache 2*32B store 8-way cclk 32B/cycle 32B/cycle 1024K L2 I+D Cache 8-way 32B/cycle 32M L3 32B/cycle R I+D Cache 16-way 16B/cycle W Data Fabric Unified 32B/cycle Memory Controlle uclkr 2x8B/cycle DRAM Channel memclk l3clk fclk 64B/cycle IO Hub lclk · AMD Ryzen Threadripper Pro 7995WX, 350W TDP, 96 cores, 192 threads, up to 5.1 GHz boost, 2.5 GHz base with 8 channels of DDR5 memory. · Three CCDs per Data Fabric quadrant shown. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 13 MICROARCHITECTURE AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 14 AMD "ZEN 4" 32K I-Cache 8 way Decode 4 i nstructions/cycle INTEGER Op Queue Dispatch 6 ma cro ops/cycle di s patched Integer Rename Scheduler Scheduler Scheduler Scheduler Branch Prediction Op Cache 9 ma cro ops /cycle FLOATING POINT Floating Point Rename Scheduler Scheduler Integer Register File FP/SIMD Register File ALU BR AGU ALU AGU ALU AGU ALU BR MU MU F2I ST L MA ADD L MA ADD F2I ST C C 3 loads per cycle 2 stores per cycle Load/Store Queues 32K D-Cache 8 Way 1M L2 (I+D) Cache 8 Way · ~13% higher IPC for desktop. · Increased op cache from 4K to 6.75K ops. · Increased L2 cache from 512 KB to 1024 KB. · Improved load store. · Improved branch prediction. · Added AVX-512 instruction support. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 15 SIMULTANEOUS MULTI-THREADING Program Threads A B Program Counter #1 Core Program Counter #2 Thread #1 Thread #2 Architectural Register Set #1 Architectural Register Set #2 · Single-threaded applications do not always occupy all resources of the processor. · The processor can take advantage of the unused resources to execute a second thread concurrently. · Although each thread has a program counter and architectural register set, core resources may be shared while operating in two-threaded mode. Scheduler Register Files, Execution Units AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 16 CORE RESOURCE SHARING DEFINITIONS Category Competitively shared Watermarked Definition Resource entries are assigned on demand. A thread may use all resource entries. Resource entries are assigned on demand. When in two-threaded mode a thread may not use more resource entries than are specified by a watermark threshold. Statically partitioned Resource entries are partitioned when entering two-threaded mode. A thread may not use more resource entries than are available in its partition. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 17 AMD "ZEN 4" CORE RESOURCE SHARING Resource Competitively Shared Watermarked Integer Scheduler X Integer Register File X Load Queue X Floating Point Physical Register X Floating Point Scheduler X Memory Request Buffers X Op Queue Store Queue Write Combining Buffer X Retire Queue Statically Partitioned X X X AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 18 INSTRUCTION SET EVOLUTION AVX512* GFNI VAES VPCLMUL CLWB ADX CLFLUSHOPT RDSEED SHA XGETBV XSAVEC XSAVES AVX2 BMI2 MOVBE RDRND FSGSBASE XSAVEOPT BMI FMA F16C AES AVX OSXSAVE PCLMUL SSE4.1 SSE4.2 XSAVE SSSE3 MONITORX CLZERO Core AMD "Zen 4" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 AMD "Zen 3" 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 AMD "Zen 2" 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 AMD "Zen 1" 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 "Jaguar" 0000000000000010011011111111100 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 19 AVX512 INSTRUCTION SET EVOLUTION AVX512_BF16 AVX512_VPOPCNTDQ AVX512_BITALG AVX512_VNNI AVX512_VBMI2 AVX512_VBMI AVX512VL AVX512BW AVX512CD AVX512_IFMA AVX512DQ AVX512F Core AMD "Zen 4" 1 1 1 1 1 1 1 1 1 1 1 1 AMD "Zen 3" 0 0 0 0 0 0 0 0 0 0 0 0 AMD "Zen 2" 0 0 0 0 0 0 0 0 0 0 0 0 AMD "Zen 1" 0 0 0 0 0 0 0 0 0 0 0 0 "Jaguar" 000000000000 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 20 SOFTWARE PREFETCH INSTRUCTIONS Prefetch(T0)|(NTA) Fill lines L1D 32 KB L2 1024 KB L3 32768 KB Aggressively Evict Prefetch NTA lines · Use Software Prefetch instructions on linked data structures experiencing cache misses. · Use NTA on use once data. · While in two-threaded mode, beware too many software prefetches may evict the working set of the other thread from their shared caches. · Prefetch(T0)|(NTA) fills into L1. · Prefetch(T1)|(T2) fills into L2. · new for AMD "Zen 4"! Prefetch (T1)|(T2) Fill lines Memory Gigabytes AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 21 HARDWARE PREFETCHERS L1 Category L1 Stream L1 Stride L1 Region Definition Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order. Uses memory access history of individual instructions to fetch additional lines when each access is a constant distance from the previous. Uses memory access history to fetch additional lines when the data access for a given instruction tends to be followed by a consistent pattern of other accesses within a localized region. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 22 HARDWARE PREFETCHERS L2 Category Definition L2 Stream Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order. L2 Up/Down Uses memory access history to determine whether to fetch the next or previous line for all memory accesses. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 23 STREAMING HARDWARE PREFETCHER · Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order. 0 40 80 C0 100 140 180 1C0 200 240 280 2C0 300 340 380 3C0 400 440 480 4C0 500 540 580 5C0 600 640 680 6C0 700 740 780 7C0 800 Memory Address Steam +1 1 2 3 4 5 6 7 8 9 10 alignas(64) float a[LEN]; // ... float sum = 0.0f; for (size_t i = 0; i < LEN; i++) { sum += a[i]; // streaming prefetch } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 24 0 40 80 C0 100 140 180 1C0 200 240 280 2C0 300 340 380 3C0 400 440 480 4C0 500 540 580 5C0 600 640 680 6C0 700 740 780 7C0 800 STRIDE HARDWARE PREFETCHER · Uses memory access history of individual instructions to fetch additional lines when each access is a constant distance from the previous. Memory Address Stride +5 Stride +5 1 2 3 4 5 1 2 3 4 struct S { double x1, y1, z1, w1; char name[256]; double x2, y2, z2, w2; }; alignas(64) S a[LEN]; // ... double sumX1 = 0.0f, sumX2 = 0.0f; for (size_t i = 0; i < LEN; i++) { sumX1 += a[i].x1; // stride prefetch 0 sumX2 += a[i].x2; // stride prefetch 1 } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 25 DESKTOP CACHE HIERARCHY EVOLUTION Core AMD "Zen 4" AMD "Zen 3" AMD "Zen 2" AMD "Zen 1" uOP/Core K 6.75 4 4 2 L1I/Core KB 32 32 32 64 L1D/Core KB 32 32 32 32 L2/Core KB 1024 512 512 512 L3/CCX MB 32* 32* 16 8 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 26 CACHE-COHERENCY PROTOCOL · The AMD cache-coherency protocol is MOESI (Modified, Owned, Exclusive, Shared, Invalid). · Instruction-execution activity and externalbus transactions may change the cache's MOESI state. · Read hits do not cause a MOESI-state change. · Write hits generally cause a MOESI-state change into the modified state. · If the cache line is already in the modified state, a write hit does not change its state. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 27 CACHE-TO-CACHE TRANSFERS Memory Data Fabric CCX0 32MB L3$ with shadow tags CCX1 32MB L3$ with shadow tags Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 · Each CPU Complex (CCX) has a L3 cache shared by up to eight cores. · The L3 cache has shadow tags for each L2 cache within its complex. · Shadow tags determine if a "fast" cache-tocache transfer between cores within the CCX is possible. · Cache-coherency probe latency responses may be slower from cores in another CCX. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 28 Core3 Core2 Core1 Core0 Core3 Core2 Core1 Core0 CACHE-COHERENCY EFFICIENCY CCX0 Data Fabric CCX1 · Minimize ping-ponging modified cache lines between cores especially in another CCX! · Minimize using Read-Modify-Write instructions. · Use a single atomic add with a local sum rather than many atomic increment operations. MMMM · Improve lock efficiency. · "Test and Test-and-Set" in user spin locks with a pause instruction. · Replace user spin locks with modern sync APIs. · Use a memory allocator optimized for multi- threading. · Try mimalloc or jemalloc. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 29 Logical Processor AMD "PREFERRED CORE" SchedulingClass (higher is better) Default EffectivePowerModeGameMode 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 · Some AMD products have cores that are faster than other cores. · Windows® may use SchedulingClass or EfficiencyClass during thread scheduling. These values may change during runtime. · Thread affinity masks may interfere with thread scheduling and power management optimizations on Windows PCs. · Testing done by AMD performance labs January 22, 2023 on an AMD reference motherboard equipped with 16GB DDR56000MHz, Ryzen 9 7950X3D with Nvidia RTX 4090, Win11 Pro x64 22621.1105. Actual results may vary. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 30 BEST PRACTICES AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 31 PREFER SHIPPING CONFIGURATION BUILDS FOR CPU PROFILING Average FPS UE5.1 City Sample DX12 1080p (higher is better) 90 80 77 70 60 50 46 40 30 20 10 0 Shipping Development Build Configuration · Disable debug features before you ship! · Debug and development builds may reduce performance. · Stats collection may cause cache pollution. · Logging may create serialization points. · Debug builds may disable multithreading optimizations. · Performance of UE4.5.1 binaries compiled with Microsoft® Visual Studio 2022 v17.4.4. · Testing done by AMD technology labs, January 30, 2023 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Cooler Master MasterLiquid ML360 RGB TR4 Edition, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 7900 XTX GPU with driver 23.1.1 (January 11, 2023), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 version 22H2, 1920x1080 resolution. Actual results may vary. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 32 DISABLE ANTI-TAMPER WHILE CPU PROFILING · Anti-tamper and Anti-Cheat technologies may prevent CPU debugging and profiling tools from working correctly especially while loading and retrieving symbol information. · Create a CPU profiling friendly build configuration similar-to the Shipping configuration but with Anti-Tamper and Anti-Cheat technologies disabled. · Add this build as a launch option during development. · Remove this build before release. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 33 TEST COLD SHADER CACHE FIRST TIME USER EXPERIENCE rem Run as administrator rem Disable Steam shader pre-caching before running this script rem Reboot after running this script to clear any shaders still in system memory setlocal enableextensions cd /d "%~dp0" rmdir /s /q "%LOCALAPPDATA%\D3DSCache" rmdir /s /q "%LOCALAPPDATA%\AMD\DX9Cache" rmdir /s /q "%LOCALAPPDATA%\AMD\DxCache" rmdir /s /q "%LOCALAPPDATA%\AMD\DxcCache" rmdir /s /q "%LOCALAPPDATA%\AMD\OglCache" rmdir /s /q "%LOCALAPPDATA%\AMD\VkCache" rmdir /s /q "%LOCALAPPDATA%\NVIDIA\DXCache" rmdir /s /q "%ProgramFiles(x86)%\Steam\steamapps\shadercache" AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 34 USE THE LATEST COMPILER AND WINDOWS® SDK Msbuild.exe UE4.sln -target:Engine\UE4:Rebuild -property:Configuration=Shipping -property:Platform=Win64 (less is better) 240 205 180 121 119 120 · Get the latest build and link time improvements. · Get the latest library and runtime optimizations. · Performance of UE4.27.2 binaries compiled with Microsoft® Visual Studio. · Testing done by AMD technology labs, February 5, 2022 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Enermax LIQTECH TR4 II series 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 6800 XT GPU with driver 21.10.2 (October 25, 2021), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 21H2, 1920x1080 resolution. Actual results may vary. seconds 60 0 2017 v15.9.43 2019 v16.11.9 2022 v17.05 Visual Studio Build Tools AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 35 ADD VIRUS AND THREAT PROTECTION EXCLUSIONS Msbuild.exe UE5.sln -target:Engine\UE5:Rebuild -property:Configuration=Shipping -property:Platform=Win64 (less is better) 240 224 182 180 120 60 0 · WARNING: Not recommended for CI/CD systems. Exclusions may make your device vulnerable to threats. · Add project folders to virus and threat protection settings exclusions for faster build times. · Faster rebuild time after optimization! · Performance of UE5.1 binaries compiled with Microsoft® Visual Studio 2022 v17.4.4. · Testing done by AMD technology labs, January 28, 2023 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Cooler Master MasterLiquid ML360 RGB TR4 Edition, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 7900 XTX GPU with driver 23.1.1 (January 11, 2023), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 version 22H2, 1920x1080 resolution. Actual results may vary. seconds None C:\UnrealEngin e-5.1 Folder Exclusions AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 36 REDUCE BUILD TIMES Msbuild.exe UE4.sln -target:Engine\UE4:Rebuild -property:Configuration=Shipping -property:Platform=Win64 (less is better) 240 231 seconds 180 119 120 60 0 VS2017, Without Virus VS2022, With Virus Exclusion Folders Exclusion Folders System Configuration · Performance of UE4.27.2 binaries compiled with Microsoft® Visual Studio. · Testing done by AMD technology labs, February 5, 2022 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Enermax LIQTECH TR4 II series 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 6800 XT GPU with driver 21.10.2 (October 25, 2021), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 21H2, 1920x1080 resolution. Actual results may vary. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 37 USE AVX OR AVX2 IF CPU MINIMUM REQUIREMENTS ALLOW Steam Hardware & Software Survey: January 2024 (higher is better) 0% 20% 40% 60% 80% 100% SSE2 100% AVX 97% AVX2 92% · A binary may have better code generation using AVX or later ISA by using the Microsoft® Visual C compiler option /arch:[AVX|AVX2|AVX512]. · Minimum hardware requirements: · Windows 10 = SSE2 · Windows 11 = SSE4.1 · The Windows 10 supported processor list includes AMD products which support AVX but not AVX2. · The Windows 10 supported processor list may include products from other CPU vendors which do not support AVX. AVX512 11% AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 38 ENABLE AVX512 IN DEVELOPMENT TOOLS embree-3.13.5.x64.vc14.windows pathtracer_ispc.exe -c asian_dragon.ecs --fullscreen --print-frame-rate (higher is better) FPS 0 5 10 15 20 25 30 · Development tools may benefit from AVX512. · Examples: · Light Baking. · Texture Compression. · Mesh to Signed Distance Fields. Disabled 23 · Testing done by AMD technology labs, January 29, 2023 on the following system. Test configuration: AMD Ryzen 7950X, NZXT Kraken X62 cooler, 32GB (2 x 16GB DDR5-6000 30-38-38-96) memory, AMD Radeon RX 7900 XTX GPU with driver 23.1.1 (January 11, 2023), 2TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 build 22H2, 1920x1080 resolution. Actual results may vary. AVX512 Enabled 27 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 39 AUDIT CONTENT · Ask artists to recommend profiling scenes of interest! · For example, an indoor dungeon, an outdoor city, an outdoor forest, large crowds, or a specific time of day. · Run Unreal Engine MapCheck! · It may find some performance issues. · https://docs.unrealengine.com/en-US/BuildingWorlds/LevelEditor/MapErrors/index.html · Use Unity AssetPostprocessor! · Enforce minimum standards. · https://docs.unity3d.com/Manual/BestPracticeUnderstandingPerformanceInUnity4.html · Check stats before CPU profiling! · If a scene far exceeds its draw budget or has many duplicate objects, report the issue to its artists and consider profiling a different scene. Otherwise, you may risk profiling hot spots which may not be hot after the art issues are resolved. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 40 SUPPORT HYBRID GRAPHICS · Use IDXGIFactory6:: EnumAdapterByGpuPreference DXGI_GPU_PREFERENCE_HIGH_PERF ORMANCE for game applications. · The user may change preferences per application in Graphics settings. · Testing done by AMD performance labs January 24, 2022 on a Dell G5 15 SE laptop equipped with, 16GB DDR4-3200MHz, Ryzen 9 4900H with Radeon RX 5600M, Win11 Pro x64 22000.434. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 41 USE PREFERRED VIDEO AND AUDIO CODECS · Prefer H264 video and AAC audio codecs as recommended by the Unreal Engine Electra Media Player. · Hardware accelerated codecs may increase hours of battery life and reduce CPU work. · AMD Radeon graphics devices released since 2022 no longer accelerate WMV3 decoding. · See amd.com product specifications for supported rendering formats. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 42 OPTIMIZATIONS AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 43 SYNC APIS AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 44 USE MODERN SYNC APIS Exclusive Lock Test (less is better) Core Isolation Memory Integrity Off Core Isolation Memory Integrity On 100% 80% 60% 40% 20% 0% · Avoid user spin locks that starve payload work on other ready threads, consume excessive power, and drain laptop batteries. · Performance of binaries compiled with Microsoft® Visual Studio 2022 v17.8.6. · Testing done by AMD technology labs, January 6, 2024 on the following system. Test configuration: AMD Ryzen Threadripper 7995WX 96Cores, NZXT Kraken 360 cooler, 256GB (8 x 32GB RDDR5-4800 memory, AMD Radeon RX 580 GPU with 31.0.12027.9001 (March 20, 2023), 1TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 23H2, 1920x1080 resolution. Actual results may vary. Total CPU Utilization at start of test AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 45 milliseconds USE MODERN SYNC APIS Exclusive Lock Test (less is better) Core Isolation Memory Integrity Off Core Isolation Memory Integrity On 200,000 150,000 100,000 50,000 0 · Prefer std::mutex which has good performance and low CPU utilization. · Legacy sync APIs like WaitForSingleObject may rely on expensive syscall instructions. · Modern sync APIs like std::mutex may rely on lower-cost mwaitx instructions that can execute at any privilege level. · Performance of binaries compiled with Microsoft® Visual Studio 2022 v17.8.6. · Testing done by AMD technology labs, January 6, 2024 on the following system. Test configuration: AMD Ryzen Threadripper 7995WX 96Cores, NZXT Kraken 360 cooler, 256GB (8 x 32GB RDDR5-4800 memory, AMD Radeon RX 580 GPU with 31.0.12027.9001 (March 20, 2023), 1TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 23H2, 1920x1080 resolution. Actual results may vary. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 46 USE MODERN SYNC APIS: SHARED CODE #include "intrin.h" #include <chrono> #include <numeric> #include <thread> #include <vector> #include <mutex> #include <Windows.h> #define LEN 128 alignas(64) float b[LEN][4][4]; alignas(64) float c[LEN][4][4]; int main(int argc, char* argv[]) { using namespace std::chrono; float b0 = (argc > 1) ? strtof(argv[1], NULL) : 1.0f; float c0 = (argc > 2) ? strtof(argv[2], NULL) : 2.0f; std::fill((float*)b, (float*)(b + LEN), b0); std::fill((float*)c, (float*)(c + LEN), c0); size_t num_threads = \ GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS); wprintf(L"num_threads: %llu\n", num_threads); std::vector<std::thread> threads = {}; auto t0 = high_resolution_clock::now(); for (size_t i = 0; i < num_threads; ++i) { threads.push_back(std::thread(fn)); } for (size_t i = 0; i < num_threads; ++i) { threads[i].join(); } auto t1 = high_resolution_clock::now(); wprintf(L"time (ms): %lli\n", \ duration_cast<milliseconds>(t1 - t0).count()); return EXIT_SUCCESS; } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 47 USE MODERN SYNC APIS: BAD USER SPIN LOCK namespace MyLock { typedef unsigned LOCK, *PLOCK; enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; void Lock(PLOCK pl) { while (LOCK_IS_TAKEN == \ _InterlockedCompareExchange(\ reinterpret_cast<long*>(pl), \ LOCK_IS_TAKEN, LOCK_IS_FREE)) { } } void Unlock(PLOCK pl) { _InterlockedExchange(reinterpret_cast<long*>(pl),\ LOCK_IS_FREE); } } MyLock::LOCK gLock; void fn() { alignas(64) float a[LEN][4][4]; std::fill((float*)a, (float*)(a + LEN), 0.0f); float r = 0.0; for (size_t iter = 0; iter < 100000; iter++) { MyLock::Lock(&gLock); for (int m = 0; m < LEN; m++) for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) for (int k = 0; k < 4; k++) a[m][i][j] += b[m][i][k] * c[m][k][j]; r += std::accumulate((float*)a, \ (float*)(a + LEN), 0.0f); MyLock::Unlock(&gLock); } wprintf(L"result: %f\n", r); } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 48 USE MODERN SYNC APIS: IMPROVED USER SPIN LOCK namespace MyLock { typedef unsigned LOCK, *PLOCK; enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; void Lock(PLOCK pl) { while ((LOCK_IS_TAKEN == *pl) || \ (LOCK_IS_TAKEN == \ _InterlockedExchange(pl, LOCK_IS_TAKEN))) { _mm_pause(); } } void Unlock(PLOCK pl) { _InterlockedExchange(reinterpret_cast<long*>(pl),\ LOCK_IS_FREE); } } alignas(64) MyLock::LOCK gLock; void fn() { alignas(64) float a[LEN][4][4]; std::fill((float*)a, (float*)(a + LEN), 0.0f); float r = 0.0; for (size_t iter = 0; iter < 100000; iter++) { MyLock::Lock(&gLock); for (int m = 0; m < LEN; m++) for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) for (int k = 0; k < 4; k++) a[m][i][j] += b[m][i][k] * c[m][k][j]; r += std::accumulate((float*)a, \ (float*)(a + LEN), 0.0f); MyLock::Unlock(&gLock); } wprintf(L"result: %f\n", r); } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 49 USE MODERN SYNC APIS: WAITFORSINGLEOBJECT // MyLock not required. Let the OS do the work! HANDLE hMutex; int main(int argc, char* argv[]) { hMutex = CreateMutex(NULL,FALSE,NULL); // otherwise main is the same as before. // ... } void fn() { alignas(64) float a[LEN][4][4]; std::fill((float*)a, (float*)(a + LEN), 0.0f); float r = 0.0; for (size_t iter = 0; iter < 100000; iter++) { WaitForSingleObject(hMutex, INFINITE); for (int m = 0; m < LEN; m++) for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) for (int k = 0; k < 4; k++) a[m][i][j] += b[m][i][k] * c[m][k][j]; r += std::accumulate((float*)a, \ (float*)(a + LEN), 0.0f); ReleaseMutex(hMutex); } wprintf(L"result: %f\n", r); } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 50 USE MODERN SYNC APIS: STD::MUTEX // MyLock not required. Let the OS do the work! std::mutex mutex; void fn() { alignas(64) float a[LEN][4][4]; std::fill((float*)a, (float*)(a + LEN), 0.0f); float r = 0.0; for (size_t iter = 0; iter < 100000; iter++) { mutex.lock(); for (int m = 0; m < LEN; m++) for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) for (int k = 0; k < 4; k++) a[m][i][j] += b[m][i][k] * c[m][k][j]; r += std::accumulate((float*)a, \ (float*)(a + LEN), 0.0f); mutex.unlock(); } wprintf(L"result: %f\n", r); } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 51 USE MODERN SYNC APIS · Prefer functions using mwaitx · std::mutex · std::shared_mutex · AcquireSRWLockExclusive · AcquireSRWLockShared · SleepConditionVariableSRW · SleepConditionVariableCS · EnterCriticalSection · Avoid functions using syscall · WaitForSingleObject · WaitForMultipleObjects AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 52 WINDOWS PERFORMANCE ANALYZER SPIN LOCK wpr.exe start cpu rem run test pause wpr.exe stop log.etl AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 53 WINDOWS PERFORMANCE ANALYZER STD::MUTEX wpr.exe start cpu rem run test pause wpr.exe stop log.etl AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 54 VISUAL STUDIO CONCURRENCY VISUALIZER SPIN LOCK AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 55 VISUAL STUDIO CONCURRENCY VISUALIZER STD::MUTEX AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 56 THREADING AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 57 TUNE THREAD POOL SIZE FOR INITIALIZATION AND GAME PLAY · This advice is specific to AMD processors and is not general guidance for all processor vendors. · Profile your game to determine the optimal thread pool size for both game initialization and play. · Utilizing all logical processors in SMT dual-thread mode may benefit game initialization. · Utilizing only physical cores, each in single-thread, mode may benefit game play. · for systems with at least 8 AMD Ryzen CPU cores. · See the core count code sample at https://gpuopen.com/learn/cpu-core-counts/. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 58 AVOID HARD AFFINITY MASKS ON PC · Hard affinity masks interfere with OS power management and thread scheduling. · CPU Sets provide APIs to declare application affinity in a 'soft' manner that is compatible with OS power management. Minimum OS Affinity Type Function Windows XP hard SetThreadAffinityMask Windows 7 hard SetThreadGroupAffinity Windows 10 soft SetThreadSelectedCpuSets Windows 11 soft SetThreadSelectedCpuSetMasks My thread hard affinity = none t0 t1 CPU0 My thread Other app CPU1 idle My thread t2 idle My thread t3 idle My thread My thread hard affinity = CPU0 t0 T1 CPU0 My thread Other app CPU1 idle Idle t2 My thread idle t3 My thread idle My thread soft affinity = CPU0 t0 T1 CPU0 My thread Other app CPU1 idle My thread t2 My thread idle t3 My thread idle AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 59 BEWARE OF PRIORITY BOOST · Thread Priority describes the order in which threads are scheduled. · Each thread has a dynamic priority. BASE PRIORITY AT THREAD PRIORITY LEVEL THREAD_PRIORITY_IDLE NORMAL_ PRIORIT Y_ CLASS ABOVE_ NORMAL_ PRIORIT Y_ CLASS 1 1 · The system boosts the dynamic priority under THREAD_PRIORITY_LOWEST 6 8 certain conditions. THREAD_PRIORITY_BELOW_NORMAL 7 9 T HREAD_PRIORIT Y_NORMAL 8 10 · Temporary priority-boosted threads may switch-in before threads intended to be T HREAD_PRIORIT Y_ABOVE_NORMAL 9 11 higher priority by the developer. T HREAD_PRIORIT Y_HIGHEST 10 12 · This feature can be disabled using SetProcessPriorityBoost and SetThreadPriorityBoost. THREAD_PRIORITY_TIME_CRITICAL 15 15 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 60 DATA ACCESS AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 61 ALIGN MEMCPY SOURCE AND DESTINATION POINTERS · Update the compiler for the latest memcpy, memset, and other C runtime optimizations! · Memcpy behavior is undefined if dest and src overlap, but the compiler may generate Rep Move String instructions which have defined overlapping behavior. · Alignas(64) may allow faster rep movs microcode. · Alignas(4096) may reduce store-to-load conflicts and benefit probe filtering on some processors. · PMCx024 LsBadStatus2 StliOther counts store-to-load conflicts where a load was unable to complete due to a non-forwardable conflict with an older store. · Aligning to the bit_floor may provide a good balance of cache hits and alignment: · std::clamp(std::bit_floor(count), 4, 4096); AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 62 AVOID FALSE SHARING 30,000 25,000 20,000 False Sharing Test (less is better) 23,975 milliseconds 15,000 10,000 5,000 0 2,557 before after optimization · "False sharing" may occur when two or more cores modify different data within the same cache line. · This microbenchmark showed its execution time reduced by about 90% after optimization using alignas(64)! · Performance of binaries compiled with Microsoft® Visual Studio 2022 v17.8.6. · Testing done by AMD technology labs, January 6, 2024 on the following system. Test configuration: AMD Ryzen Threadripper 7995WX 96Cores, NZXT Kraken 360 cooler, 256GB (8 x 32GB RDDR5-4800 memory, AMD Radeon RX 580 GPU with 31.0.12027.9001 (March 20, 2023), 1TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 23H2, 1920x1080 resolution. Actual results may vary. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 63 AVOID FALSE SHARING #include <chrono> #include <numeric> #include <thread> #include <vector> #include <Windows.h> #if defined (APPLY_OPTIMIZATION) /* 64 bytes */ struct alignas(64) ThreadData { unsigned long sum; }; #else /* 4 bytes */ struct ThreadData { unsigned long sum; }; #endif using namespace std::chrono; #define NUM_ITER 100000000 void fn(ThreadData* p, size_t seed) { srand(static_cast<unsigned int>(seed)); p->sum = 0; for (int i = 0; i < NUM_ITER; i++) p->sum += rand() % 2; } int main(int argc, char* argv[]) { size_t num_threads = \ GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS); wprintf(L"num_threads: %llu\n", num_threads); ThreadData* a = static_cast<ThreadData*>(_aligned_malloc( num_threads * sizeof(ThreadData), 64)); if (nullptr == a) return EXIT_FAILURE; std::vector<std::thread> threads = {}; auto t0 = high_resolution_clock::now(); for (size_t i = 0; i < num_threads; ++i) threads.push_back(std::thread(fn, &a[i], i)); for (size_t i = 0; i < num_threads; ++i) threads[i].join(); auto t1 = high_resolution_clock::now(); wprintf(L"time (ms): %lli\n", duration_cast<milliseconds>(t1 - t0).count()); for (size_t i = 0; i < num_threads; ++i) wprintf(L"sum[%llu] = %lu\n", i, (*(a + i)).sum); _aligned_free(a); return EXIT_SUCCESS; } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 64 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 65 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 66 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 67 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 68 USE SOFTWARE PREFETCH INSTRUCTIONS FOR LINKED DATA Nvidia PhysX 4.1 KaplaDemo AMD Ryzen 7 4700G, NVidia GeForce RTX (higher is better) 250 2080 210 200 150 125 · Over 60% faster after optimization! · Performance of binaries compiled with Microsoft® Visual Studio 2019 v16.8.3. · Testing done by AMD technology labs, January 4, 2021 on the following system. Test configuration: AMD Ryzen 7 4700G, AMD Wraith Spire Cooler, 16GB (2 x 8GB DDR4-3200 at 22-22-22-52) memory, NVidia GeForce RTX 2080 GPU with driver 460.89 (December 15, 2020), 512GB M.2 NVME SSD, AMD Ryzen Reference Motherboard, Windows® 10 x64 build 20H2, 1920x1080 resolution. Actual results may vary 100 Average FPS At start of demo 50 0 before after optimization AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 69 USE SOFTWARE PREFETCH INSTRUCTIONS FOR LINKED DATA // Copyright (c) 2021 NVIDIA Corporation. All rights reserved // ConvexRenderer.cpp from https://github.com/NVIDIAGameWorks/PhysX/tree/4.1/physx void ConvexRenderer::updateTransformations() { for (int i = 0; i < (int)mGroups.size(); i++) { ConvexGroup *g = mGroups[i]; if (g->texCoords.empty()) continue; float* tt = &g->texCoords[0]; for (int j = 0; j < (int)g->convexes.size(); j++) { const Convex* c = g->convexes[j]; #if defined(APPLY_OPTIMIZATION) int distance = 4; // TODO find ideal number size_t future = (j + distance) % g->convexes.size(); _mm_prefetch(0x0F8 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mPxActor _mm_prefetch(0x100 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mLocalPose _mm_prefetch(0x148 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.x _mm_prefetch(0x14C + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.y _mm_prefetch(0x150 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.z _mm_prefetch(0x164 + (char*)(g->convexes[future]), _MM_HINT_NTA); //mSurfaceMaterialId _mm_prefetch(0x160 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialId #endif PxMat44 pose(c->getGlobalPose()); float* mp = (float*)pose.front(); float* ta = tt; for (int k = 0; k < 16; k++) { *(tt++) = *(mp++); } PxVec3 matOff = c->getMaterialOffset(); ta[3] = matOff.x; ta[7] = matOff.y; ta[11] = matOff.z; int idFor2DTex = c->getSurfaceMaterialId(); int idFor3DTex = c->getMaterialId(); const int MAX_3D_TEX = 8; ta[15] = (float)(idFor2DTex*MAX_3D_TEX + idFor3DTex); } glBindTexture(GL_TEXTURE_2D, g->matTex); glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, g->texSize, g->texSize, GL_RGBA, GL_FLOAT, &g->texCoords[0]); glBindTexture(GL_TEXTURE_2D, 0); } } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 70 AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS milliseconds mesh_to_sdf.exe --maxload AVX2(8-wide) (less is better) 45,000 40,000 39,623 35,000 30,000 25,000 20,000 15,000 12,589 10,000 5,000 0 before after optimization · For AMD "Zen 2" and "Zen 3" CPUs, there is a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero data. · Benchmark execution time was reduced by over 60% after a VZeroUpper optimization. · Performance of binaries compiled with Microsoft® Visual Studio 2022 v17.8.6. · Testing done by AMD technology labs, February 8, 2024 on the following system. Test configuration: AMD Ryzen Threadripper PRO 5995WX, Enermax LIQTECH TR4 II series 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory, AMD Radeon RX 6700 XT GPU with driver 24.1.1 (January 11, 2024), 1TB M.2 NVME SSD, AMD Reference Motherboard, Windows® 11 x64 version 23H2, 1920x1080 resolution. Actual results may vary. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 71 AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS · Use "SSE_AVX_STALLS" PMCx00E Floating Point Dispatch Faults > 0 to find code which may be missing VZeroUpper or VZeroAll instructions during AVX to SSE and SSE to AVX transitions. · Optimization 1: · Use the /arch:AVX compiler flag. · AVX is supported by 97% of users according to the January 2024 Steam Hardware & Software Survey. · Optimization 2: · Return a __m256 value using pass-by-reference in the function parameter list rather than the function return type. · Optimization 3: · Use __forceinline on the function definition. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 72 AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS // Before Optimization __m256 udTriangle_sq_precalc_SIMD_8grid( const __m256 p_x, const __m256 p_y, const __m256 p_z, const tri_precalc_t &pc ) { // ... __m256 res = _mm256_blendv_ps( res1, res0, cmp ); return res; } // After Optimization void udTriangle_sq_precalc_SIMD_8grid( const __m256 p_x, const __m256 p_y, const __m256 p_z, const tri_precalc_t& pc, __m256 &ret ) { // ... ret = _mm256_blendv_ps( res1, res0, cmp ); } AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 73 Before the optimization, SSE_AVX_STALLS may occur because there is no VZeroUpper or VZeroAll instruction during the AVX to SSE transition. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 74 After the optimization, SSE_AVX_STALLS have been reduced because there is a VZeroUpper instruction during the AVX to SSE transition. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 75 DO YOU WANT TO KNOW MORE? AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 76 SOFTWARE OPTIMIZATION GUIDES AT AMD.COM CPU Architecture AMD "Zen 4" AMD "Zen 3" AMD "Zen 2" AMD "Zen 1" Publication No. 57647 56665 56305 55723 Nam e Software Optimization Guide for the AMD "Zen4" Microarchitecture Software Optimization Guide for AMD EPYC 7003 Processors (formerly Software Optimization Guide for AMD Family 19h Processors) Software Optimization Guide for AMD EPYC 7002 Processors (formerly Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors) Software Optimization Guide for AMD Family 17h Processors AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 78 John.Hartwig@amd.com Kenneth.Mitchell@amd.com AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 79 Design faster. Render faster. Iterate faster. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 80 DISCLAIMER AND NOTICES Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this informat ion and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. THIS INFORMATION IS PROVIDED `AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD is not responsible for any electronic virus or damage or losses therefrom that may be caused by changes or modifications that you make to your system, including but not limited to antivirus software. Changes to your system configurations and settings, including but not limited to antivirus software, is done at your sole discretion and under no circumstances will AMD be liable to you for any such changes. You assume all risk and are solely responsible for any damages that may arise from or are related to changes that you make to your system, including but not limited to antivirus software. AMD, the AMD Arrow logo, Ryzen , Threadripper , Radeon , and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. Microsoft, Windows, and Visual Studio are registered trademarks of Microsoft Corporation in the US and/ or other countries. Unreal® is a trademark or registered trademark of Epic Games, Inc. in the United States of America and elsew here. NVIDIA is a trademark and/or registered trademark of NVIDIA Corporation in the U.S. and/or other countries. Steam is a trademark and/or registered trademark of Valve Corporation. PCIe is a registered trademark of PCI-SIG. AMD products or technologies may include hardware to accelerate encoding or decoding of certain video standards but require t he use of additional programs/applications. ©2024 Advanced Micro Devices, Inc. All rights reserved. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 81 DISCLAIMER AND NOTICES · Code sample on slide 70 is modified. · Copyright (c) 2024 NVIDIA Corporation. All rights reserved. Code Sample is licensed subject to the following: "Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE." · MeshToSDF, Copyright 2024 Mikkel Gjoel under MIT License. https://github.com/pixelmager/MeshToSDF · Infiltrator Demo and City Sample use the Unreal® Engine. Unreal® is a trademark or registered trademark of Epic Games, Inc. in the United States of America and elsewhere. · Unreal® Engine, Copyright 1998 2024, Epic Games, Inc. All rights reserved. · Intel® Embree is released as Open Source under the Apache 2.0 license. AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 82 DISCLAIMER AND NOTICES · Claim "Zen 4" average 13% IPC uplift compared to "Zen 3" desktop processors · RPL-005: Testing as of 15 August, 2022, by AMD Performance Labs using the following hardware: AMD AM5 Reference Motherboard with AMD Ryzen 7 7700X with G.Skill DDR5-6000C30 (F5-6000J3038F16GX2-TZ5N) with AMD EXPO loaded, AMD AM4 Reference Motherboard with AMD Ryzen 7 5800X and DDR4-3600C16. Processors fixed to 4GHz frequency with 8C16 enabled and evaluated with 22 different workloads. ALL SYSTEMS configured with NXZT Kraken X63, open air test bench, Radeon RX 6950XT (driver 22.7.1 Optional), Windows® 11 22000.856, AMD Smart Access Memory/PCIe® Resizable Base Address Register ("ReBAR") ON, VirtualizationBased Security (VBS) OFF. Results may vary. · Design faster. Render faster. Iterate faster. Create more, faster with AMD Ryzen processors · Testing by AMD Performance Labs as of September 23, 2020 using a Ryzen 9 5950X and Intel Core i9-10900K configured with DDR4-3600C16 and NVIDIA GeForce RTX 2080 Ti. Results may vary. R5K-039 · The information contained herein is for informational purposes only, and is subject to change without notice. Timelines, roadmaps, and/or product release dates shown in these slides are plans only and subject to change. "Navi", "Vega", "Polaris", "Zen, "Zen+", "Zen 2", "Zen 3", and "Zen 4" are codenames for AMD architectures, and are not product names. GD-122 AMD PUBLIC | GDC24 | AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2024 83