Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newsvuvnviv taint api speedup #22662

Open
wants to merge 6 commits into
base: blead
Choose a base branch
from

Conversation

bulk88
Copy link
Contributor

@bulk88 bulk88 commented Oct 13, 2024

inspire by looking at bug #22653, and remarks in the past over these 3 fns being super
important esp for enterprise serialization/deserial/wire format decoding.

They are also sort of related to a very bad failed optimization (MSVC compiler went to "-O0" and added 100s of KBs of redundant code in perl541.dll and some KBs more in XS DLLs), done 2-3 years ago in perl core. But im still working on a fix/diag/analysis/solution for that. This branch of commits covers more about serial/deserial performance, and taking unique advantage that IV NV UVs are no-malloc SV and that they are bodyless.

Plus in one spot, my additional "bodyless" optimization from some years ago disappeared through code churn at
9155444 I put it back, since bodyless SVs
are very light weight. And newSV_type() is very heavy with many many branches inside.

related to "SvUV() macro 100% of time calls Perl_sv_2uv_flags"
Perl#22653

Until Perl#22653 is solved, clean up newSVuv() and remove branch
to "newSViv()" that is unexplained by git blame. BUT, keep original
intent and behaviour of "newSViv()" branch for now.

Add asserts to guard against 0x8000,0000 == SVf_IVisUV changing.
Value of SVf_IVisUV can change in the future, and there might be
(I didn't git blame), logic that sign flag and SVf_IVisUV are equal.
But these changes depend on SVf_IVisUV being 0x8000,0000 and must
be updated if SVf_IVisUV changes.

Change SvXXXV_set() to be an explicity bodyless SV head optimization. MSVC
2022 -O1 combined SET_SVANY_FOR_BODYLESS_IV() and SvIV_set(). But instead
of hopes and prayers on "UB" or "ISB/IDB" of CCs that could change at
random in any previous or future build number of a CC, do it explictly.
Bodyless SV head API is defined by P5P, not CC vendors.

9155444 3/20/2022 3:05:10 PM
Perl_newSViv: simplify by using (inline) newSV_type

Fix deoptimized Perl_newSViv(). In that commit it forgot about
Perl_newSVuv(). Since newSV_type() is a inline fn, and "inline" is
CC domain UB optimization. And newSV_type() is far more complex than
CPP macro new_SV(), and newSV_type() depends on 100% perfection from
CC's LTO engine and ".o" disk format, and possibly depends on the CC
breaking ISO C spec with -O3 or -O4. Which turn on extreme SEGV inducing
C variable aliasing rules that few C code bases tolerate. Quick examples,
a reddit comment (not credible), claims "uint8_t *" and "char *" can not
be casted since the CC or CPU has 9 bit bytes or a 9 wire data bus, and
ECC parity wire is 9 of 9 for "uint8_t" and 8 of 9 for "signed char" and
wire 9 for "char" is the ECC parity wire. The platform's libc's fwritef(), hides the
secretly converts 9 bit bytes, to standard 8 bit bytes, making
the CC "ISO C compliant".

My more realistic scenario, inside newSV_type(). How can the CC know,
what if Perl_more_sv() or Perl_more_bodies(), calls mprotect(), modifies
"static const struct body_details bodies_by_type [];",
calls mprotect() again, and returns execution to newSV_type()?

Just switch to new_SV(). Its a CPP macro, not subject to CC UB inlining,
and new_SV() only has 1 fn call and is super light weight.

Old P5P commits/ML/CPAN dev talk about this area of code being crucial to
(CPAN XS) deserializing perf in perl, so perf considerations,
with proper asserts, has priority over readability.

Links to old core commits in Perl#22653
briefly discuss deserializing perf as rational, so this patch also follows
that design idea.

Perl_vnewSVpvf(), "malloc(1ch);" which in reality is "malloc(16ch)" makes
no sense, since almost zero chance fmtstr+args+\0 <= 16, and perl malloc()
round up, is semi-UB/a build flag default on anyways. Using guesstimate
malloc(pat_len), increases chances far higher, that a realloc() inside
sv_vcatpvfn_flags(), OS realloc(), will realloc() in place, not changing
the ptr, esp assuming OS malloc() does bucket of power of 2
allocator algo. Assume, 40ch malloc() fmt string, bucket to 64ch by
OS malloc(), throw in a %u 32b, that is max +10ch-2ch for "%u".
So output is 48ch. realloc(48ch) is inplace, therefore it is a win.
@richardleach
Copy link
Contributor

newSV_type() is very heavy

The intention was always that calls to newSV_type() with a static type argument would be inlined. (Hence, your bodiless optimisation would not have disappeared.) Are you finding that this is not the case? Because of -O0?

@richardleach
Copy link
Contributor

newSV_type() is very heavy

The intention was always that calls to newSV_type() with a static type argument would be inlined. (Hence, your bodiless optimisation would not have disappeared.) Are you finding that this is not the case? Because of -O0?

Ah, I've seen there's some discussion on #p5p. I'll try to catch up on that tonight.

@bulk88 bulk88 force-pushed the newSVuv_nv_iv_taint_api_speedup branch from c7aaa28 to 2050add Compare October 14, 2024 23:11
@bulk88
Copy link
Contributor Author

bulk88 commented Oct 14, 2024

repushed branch, fixed -DNO_TAINT_SUPPORT build failure

@jkeenan
Copy link
Contributor

jkeenan commented Oct 16, 2024

This p.r. has repeatedly failed to build on one of our CI setups. Please see:
https://github.com/Perl/perl5/actions/runs/11336336622/job/31611461445?pr=22662

-design and rational in src comments, this patch forces MSVC 2022 x64 to
 use 64b integer math/CPU ops (regs RAX/RDX/RCX), vs 2 sequences/pairs of
 EAX/EBX/ECX register ops removing a couple CPU instructions in filling
 out the SV HEAD. This optimization will translate to all OSes. It is
 broken out into a separate commit for git bisect reasons since it
 touches the alignment topic. As with part 1, some members of the
 community care about rapidly creating massive amounts of
 SVIVs/SVUVs/SVNVs in deserializing wire/protocol/disk formations, or
 big data sci num crunching.
This reverts commit aae9cea.

Author note, hand editing required to revert since commit was from 2005
and it is 2024. Part 1 of ? to optimize and reduce overhead of SvTAINT()
macro inside all SV * allocator fncs.

Using taint feature is rare, and "push(sv), push(my_perl), call()" is alot
smaller machine code at the many call sites, than
"push(0), push(0), push(116), push(0), push(sv), push(my_perl), call()"
and using taint at runtime, means the user decided perf is irrelavent vs
security.

newSViv()/newSVuv()/newSVnv() are malloc()-free but Perl_sv_magicext()
contains "sv_upgrade(SVt_PVMG); calloc(1,0x30);" and not for taint-feat,
but also a 2nd "malloc(0x****)". Factor out all those sv_magic() calls
into a wrapper for the unlikely branch. SvTAINT() has many call sites in
hottest parts of perl.
-make Perl_sv_taint() return the SV *, useful for a future optimization
 previous it was void

This part 2, along with part 1. Shows improvement. Delta, after 1 & 2.

previous miniperl.exe Win64 .text section, VC 2022 -O1
0x12440C bytes long

after
0x1240AC bytes long

864 bytes of machine code were removed. A bin analysis tool shows
has Perl_sv_taint() 62 callers in miniperl.exe
"SvTAINT();" contains "if(PL_tainting && PL_tainted) sv_taint(sv);"
that is 2 One Byte reads and 2 branches. Collapse the 2 bool chars, to a
U16, so it is exactly 1 read, and 1 branch. Strips complexity from the
very bottom of the very hot newSVuv/newSViv/newSVuv, and other callers.
sv_taint(sv) has 62 callers, not sure how many do the 2 reads, 2 branches
SvTAINT(sv);, but the change decreased the size of miniperl.exe and
therefore perl541.dll, and branches were removed from the
newSVuv/newSViv/newSVuv trio.

Delta machine code bytes, between part 2 & 3 (this commit).

previous miniperl.exe Win64 .text section, VC 2022 -O1
0x1240AC bytes long

after
0x12408C bytes long
Perl_newSVnv/Perl_newSViv/Perl_newSVuv, currently have to save the fresh
SV *, either on C stack, or in non volatile registers, around the
possible Perl_sv_taint() fn call inside SvTAINT(). If Perl_sv_taint()
returns its SV * argument, and assigns it back to the same C var, now
these 3 performance critical SV allocator functions, after plucking
the SV head from the arena, these 3 function never ever have to
store the fresh SV * back to C stack for any reason during their execution.

This optimization removes pop/push pairs of the C compiler saving
non-volatile registers and restoring them at function entry and exit since
after SvTAINTTC() change, NO variables AT ALL, have to be saved around
any function calls in Perl_newSVnv/Perl_newSViv/Perl_newSVuv. Also the
SV head *, after being delinked/removed from an areana, can now be
stored through the whole function, in the x86 EAX/x64 RAX register, and
pass through to the caller, without a final (non vol) reg to
(vol retval reg) mov/copy cpu op.

Remember eax/rax/retval registers, are always wiped after each fn call,
but the refactoring of SvTAINTTC() conviently returns the
SV * back to us, in the ABI return register, and we let the fresh
SV * glide through on the "heavy" Perl_sv_taint() branch, from
Perl_sv_taint() to Perl_newSViv()'s caller, without touching it, 0
machine code ops.

Few code sites were changed from SvTAINT() to SvTAINTTC(), to keep this
patch smaller, and the Perl_sv_set*vXXX() category of functions, all have
void return types and can't be chained. Also the Perl_sv_taint() branch
can be tail called or converted to a JMP insted of CALL, if the CC/OS/ABI
wants to now.

This is the final part of speeding up
Perl_newSVnv/Perl_newSViv/Perl_newSVuv there is nothing else to remove or
optimze.
@bulk88 bulk88 force-pushed the newSVuv_nv_iv_taint_api_speedup branch from 2050add to c77e852 Compare October 17, 2024 08:32
@bulk88
Copy link
Contributor Author

bulk88 commented Oct 17, 2024

fixed asserts for 32b ptr builds

@bulk88
Copy link
Contributor Author

bulk88 commented Oct 17, 2024

macOS (Monterey) 12 (-Uusethreads) passed.

I did no changes except for moving a static assert that failed i386.

macOS (Monterey) 12 (-Uusethreads)

now

#   Failed test 'write: stat and lstat returned same values'
#   at t/stat.t line 44.
#     Structures begin differing at:
#          $got->[8] = '1729155292.99179'
#     $expected->[8] = '1729155292.76969'
# Looks like you failed 1 test of 43.
../dist/Time-HiRes/t/stat.t .......................................... 
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/43 subtests 

is this a flip flop timing test that fails regularly?

 8 atime    last access time in seconds since the epoch

IIRC Win NT kernel API refuses to update access time any faster than 1 full second.

@bulk88
Copy link
Contributor Author

bulk88 commented Oct 17, 2024

khwilliamson@68c1b38

#19321

related bug tickets I found

for

#     Structures begin differing at:
#          $got->[8] = '1729155292.99179'
#     $expected->[8] = '1729155292.76969'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants