Weird znver4 performance hit compared to x86-64-v4 #359

danog · 2024-09-28T19:02:44Z

From https://gitlab.archlinux.org/archlinux/packaging/packages/php/-/merge_requests/3: as can be seen by the benchmarks, the new znver4 repos actually have worse performance than the x86-64-v4 repos (both OOTB with packages from the repo, and when self-building php with or without LTO).

This seems quite strange to me, as I've looked through GCC's source code, specifically the flag selection logic for the various arches, and I've verified znver4 is a strict superset of x86-64-v4:

x86-64-v4:

PTA_64BIT | PTA_MMX | PTA_SSE
  | PTA_SSE2 | PTA_FXSR
  | PTA_CX16 | PTA_POPCNT | PTA_SSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_SSSE3
  | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2 | PTA_F16C | PTA_FMA | PTA_LZCNT
  | PTA_MOVBE | PTA_XSAVE
  | PTA_AVX512F | PTA_AVX512BW | PTA_AVX512CD | PTA_AVX512DQ | PTA_AVX512VL

znver4:

PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2
  | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2
  | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT
  | PTA_FSGSBASE | PTA_RDRND | PTA_MOVBE | PTA_MWAITX | PTA_ADX | PTA_RDSEED
  | PTA_CLZERO | PTA_CLFLUSHOPT | PTA_XSAVEC | PTA_XSAVES | PTA_SHA | PTA_LZCNT
  | PTA_POPCNT| PTA_CLWB | PTA_RDPID
  | PTA_WBNOINVD | PTA_VAES | PTA_VPCLMULQDQ
  | PTA_PKU | PTA_ZNVER3 | PTA_AVX512F | PTA_AVX512DQ
  | PTA_AVX512IFMA | PTA_AVX512CD | PTA_AVX512BW | PTA_AVX512VL
  | PTA_AVX512BF16 | PTA_AVX512VBMI | PTA_AVX512VBMI2 | PTA_GFNI
  | PTA_AVX512VNNI | PTA_AVX512BITALG | PTA_AVX512VPOPCNTDQ | PTA_EVEX512

And same goes for the processor info flags:

{"x86-64-v4", PROCESSOR_K8, CPU_GENERIC, PTA_X86_64_V4 | PTA_NO_TUNE, 0, P_NONE}

{"znver4", PROCESSOR_ZNVER4, CPU_ZNVER4, PTA_ZNVER4, M_CPU_SUBTYPE (AMDFAM19H_ZNVER4), P_PROC_AVX512F}

So I can't explain the weird performance hit of znver4...

Note that all tests were fully automated using docker, actually the exact same dockerfile was used, switching out just the architecture in makepkg.conf and in the repos (appropriately re-installing all packages after doing that).

The text was updated successfully, but these errors were encountered:

ptr1337 · 2024-09-28T19:39:11Z

Hi,

Thanks for benchmarking this. I would also check this locally. Do you use the default provided config from Cachy?
Also, which CPU do you have?

I can only retest on a 9950X currently.

checked also with bin-cpuflags-x86 on the compiled binary:

znver4:

bin-cpuflags-x86 /usr/bin/php
Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512_VBMI AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE 
Warning: CPUID usage detected. The program can switch instruction sets in runtime.

v4:

Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE 
Warning: CPUID usage detected. The program can switch instruction sets in runtime.

AVX512_VBMI appears to be aditonally applied according bin-cpuflags-x86. Im not sure tho, if it does show all applied flags.

vnepogodin · 2024-09-28T19:53:26Z

well for us matters if LTO really introduce regression with our php PKGBUILD.

znver4 vs v4 diff can be on the margin of error

danog · 2024-09-28T19:58:38Z

Sure, LTO is the real regression, and the margin between znver4 and v4 is small, but it still is significant (and reproducible).
I'll publish the scripts and config used for benchmarks in the coming days, in the meantime, I tested on a Ryzen 9 7950X.

ptr1337 · 2024-09-29T11:28:03Z

206cdf0

Got the LTO regression also verified, disabled LTO for now, as archlinux does.

danog · 2024-10-02T09:46:39Z

@ptr1337 I've published the set of scripts used to make the benchmarks: https://github.com/nicelocal/microarch-benchmarks

ptr1337 · 2024-12-09T15:45:36Z

@ptr1337 I've published the set of scripts used to make the benchmarks: nicelocal/microarch-benchmarks

Thanks! i try to reprod them in my vacation :)

danog · 2024-12-09T16:30:30Z

Awesome! I've been thinking about it for a bit, and while I didn't look too much into it, the fact that -march=native behaves better than both x86-64-v4 and znver4 (even if the -march of native for the CPU is actually znver4), makes me thing that the cause of the regression is some other flag (or lack thereof), maybe the ones tuning the CPU cache size (which are used with -march=native)...

danog · 2024-12-09T16:31:44Z

Currently I pass to the makepkg flags the output of https://github.com/hartwork/resolve-march-native to get performance that is higher than both znver4 and x86-64-v4 (also self-built, not the repo versions)

ptr1337 · 2024-12-09T16:32:11Z

Awesome! I've been thinking about it for a bit, and while I didn't look too much into it, the fact that -march=native behaves better than both x86-64-v4 and znver4 (even if the -march of native for the CPU is actually znver4), makes me thing that the cause of the regression is some other flag (or lack thereof), maybe the ones tuning the CPU cache size (which are used with -march=native)...

Actually. we are passing -march=native to the znerv4 repositroy. To the v4 repository we pass x86-64-v4.
The used CPU on the znver4 repository is a 7700.

The main reason behind this, because -march=native on a Zen4 CPU passes shstk (shadowstack), while -march=znver4 does not pass them.
Outisde of that the flags didnt look different.

danog · 2024-12-09T16:36:04Z

we are passing -march=native to the znerv4 repositroy

Aha! That might be the issue then...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird znver4 performance hit compared to x86-64-v4 #359

Weird znver4 performance hit compared to x86-64-v4 #359

danog commented Sep 28, 2024

ptr1337 commented Sep 28, 2024

vnepogodin commented Sep 28, 2024 •

edited

Loading

danog commented Sep 28, 2024

ptr1337 commented Sep 29, 2024

danog commented Oct 2, 2024

ptr1337 commented Dec 9, 2024

danog commented Dec 9, 2024

danog commented Dec 9, 2024 •

edited

Loading

ptr1337 commented Dec 9, 2024 •

edited

Loading

danog commented Dec 9, 2024 •

edited

Loading

Weird znver4 performance hit compared to x86-64-v4 #359

Weird znver4 performance hit compared to x86-64-v4 #359

Comments

danog commented Sep 28, 2024

ptr1337 commented Sep 28, 2024

vnepogodin commented Sep 28, 2024 • edited Loading

danog commented Sep 28, 2024

ptr1337 commented Sep 29, 2024

danog commented Oct 2, 2024

ptr1337 commented Dec 9, 2024

danog commented Dec 9, 2024

danog commented Dec 9, 2024 • edited Loading

ptr1337 commented Dec 9, 2024 • edited Loading

danog commented Dec 9, 2024 • edited Loading

vnepogodin commented Sep 28, 2024 •

edited

Loading

danog commented Dec 9, 2024 •

edited

Loading

ptr1337 commented Dec 9, 2024 •

edited

Loading

danog commented Dec 9, 2024 •

edited

Loading