Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird znver4 performance hit compared to x86-64-v4 #359

Open
danog opened this issue Sep 28, 2024 · 10 comments
Open

Weird znver4 performance hit compared to x86-64-v4 #359

danog opened this issue Sep 28, 2024 · 10 comments

Comments

@danog
Copy link
Contributor

danog commented Sep 28, 2024

From https://gitlab.archlinux.org/archlinux/packaging/packages/php/-/merge_requests/3: as can be seen by the benchmarks, the new znver4 repos actually have worse performance than the x86-64-v4 repos (both OOTB with packages from the repo, and when self-building php with or without LTO).

This seems quite strange to me, as I've looked through GCC's source code, specifically the flag selection logic for the various arches, and I've verified znver4 is a strict superset of x86-64-v4:

x86-64-v4:

PTA_64BIT | PTA_MMX | PTA_SSE
  | PTA_SSE2 | PTA_FXSR
  | PTA_CX16 | PTA_POPCNT | PTA_SSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_SSSE3
  | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2 | PTA_F16C | PTA_FMA | PTA_LZCNT
  | PTA_MOVBE | PTA_XSAVE
  | PTA_AVX512F | PTA_AVX512BW | PTA_AVX512CD | PTA_AVX512DQ | PTA_AVX512VL

znver4:

PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2
  | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2
  | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT
  | PTA_FSGSBASE | PTA_RDRND | PTA_MOVBE | PTA_MWAITX | PTA_ADX | PTA_RDSEED
  | PTA_CLZERO | PTA_CLFLUSHOPT | PTA_XSAVEC | PTA_XSAVES | PTA_SHA | PTA_LZCNT
  | PTA_POPCNT| PTA_CLWB | PTA_RDPID
  | PTA_WBNOINVD | PTA_VAES | PTA_VPCLMULQDQ
  | PTA_PKU | PTA_ZNVER3 | PTA_AVX512F | PTA_AVX512DQ
  | PTA_AVX512IFMA | PTA_AVX512CD | PTA_AVX512BW | PTA_AVX512VL
  | PTA_AVX512BF16 | PTA_AVX512VBMI | PTA_AVX512VBMI2 | PTA_GFNI
  | PTA_AVX512VNNI | PTA_AVX512BITALG | PTA_AVX512VPOPCNTDQ | PTA_EVEX512

And same goes for the processor info flags:

{"x86-64-v4", PROCESSOR_K8, CPU_GENERIC, PTA_X86_64_V4 | PTA_NO_TUNE, 0, P_NONE}

{"znver4", PROCESSOR_ZNVER4, CPU_ZNVER4, PTA_ZNVER4, M_CPU_SUBTYPE (AMDFAM19H_ZNVER4), P_PROC_AVX512F}

So I can't explain the weird performance hit of znver4...

Note that all tests were fully automated using docker, actually the exact same dockerfile was used, switching out just the architecture in makepkg.conf and in the repos (appropriately re-installing all packages after doing that).

@ptr1337
Copy link
Member

ptr1337 commented Sep 28, 2024

Hi,

Thanks for benchmarking this. I would also check this locally. Do you use the default provided config from Cachy?
Also, which CPU do you have?

I can only retest on a 9950X currently.

checked also with bin-cpuflags-x86 on the compiled binary:

znver4:

bin-cpuflags-x86 /usr/bin/php
Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512_VBMI AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE 
Warning: CPUID usage detected. The program can switch instruction sets in runtime.

v4:

Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE 
Warning: CPUID usage detected. The program can switch instruction sets in runtime.

AVX512_VBMI appears to be aditonally applied according bin-cpuflags-x86. Im not sure tho, if it does show all applied flags.

@vnepogodin
Copy link
Member

vnepogodin commented Sep 28, 2024

well for us matters if LTO really introduce regression with our php PKGBUILD.

znver4 vs v4 diff can be on the margin of error

@danog
Copy link
Contributor Author

danog commented Sep 28, 2024

Sure, LTO is the real regression, and the margin between znver4 and v4 is small, but it still is significant (and reproducible).
I'll publish the scripts and config used for benchmarks in the coming days, in the meantime, I tested on a Ryzen 9 7950X.

@ptr1337
Copy link
Member

ptr1337 commented Sep 29, 2024

206cdf0

Got the LTO regression also verified, disabled LTO for now, as archlinux does.

@danog
Copy link
Contributor Author

danog commented Oct 2, 2024

@ptr1337 I've published the set of scripts used to make the benchmarks: https://github.com/nicelocal/microarch-benchmarks

@ptr1337
Copy link
Member

ptr1337 commented Dec 9, 2024

@ptr1337 I've published the set of scripts used to make the benchmarks: nicelocal/microarch-benchmarks

Thanks! i try to reprod them in my vacation :)

@danog
Copy link
Contributor Author

danog commented Dec 9, 2024

Awesome! I've been thinking about it for a bit, and while I didn't look too much into it, the fact that -march=native behaves better than both x86-64-v4 and znver4 (even if the -march of native for the CPU is actually znver4), makes me thing that the cause of the regression is some other flag (or lack thereof), maybe the ones tuning the CPU cache size (which are used with -march=native)...

@danog
Copy link
Contributor Author

danog commented Dec 9, 2024

Currently I pass to the makepkg flags the output of https://github.com/hartwork/resolve-march-native to get performance that is higher than both znver4 and x86-64-v4 (also self-built, not the repo versions)

@ptr1337
Copy link
Member

ptr1337 commented Dec 9, 2024

Awesome! I've been thinking about it for a bit, and while I didn't look too much into it, the fact that -march=native behaves better than both x86-64-v4 and znver4 (even if the -march of native for the CPU is actually znver4), makes me thing that the cause of the regression is some other flag (or lack thereof), maybe the ones tuning the CPU cache size (which are used with -march=native)...

Actually. we are passing -march=native to the znerv4 repositroy. To the v4 repository we pass x86-64-v4.
The used CPU on the znver4 repository is a 7700.

The main reason behind this, because -march=native on a Zen4 CPU passes shstk (shadowstack), while -march=znver4 does not pass them.
Outisde of that the flags didnt look different.

@danog
Copy link
Contributor Author

danog commented Dec 9, 2024

we are passing -march=native to the znerv4 repositroy

Aha! That might be the issue then...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants