-
-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster swizzle code #1608
base: master
Are you sure you want to change the base?
Faster swizzle code #1608
Conversation
run time in seconds for each implementation (using MSVC): int main() {
int width = 2048;
int height = 2048;
int bpp = 1;
int pitch = bpp * width;
uint8_t* garbage_in = (uint8_t*)malloc(width * height * bpp);
uint8_t* garbage_out = (uint8_t*)malloc(width * height * bpp);
for (int i = 0; i < 1000; i++) {
swizzle_rect(garbage_in, width, height, garbage_out, pitch, bpp);
unswizzle_rect(garbage_out, width, height, garbage_in, pitch, bpp);
}
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a big improvement in this swizzle logic performance, nice! In addition to the minor feedback on code, a couple things:
- Similar to the benchmark, please also test to confirm that the swizzle/unswizzle results are the same in all cases: current impl, new faster sw impl, and new instruction impl
- Please determine how well this instruction will be supported on user systems. Find out which generation of x86 processors includes this instruction, and if it is relatively commonplace now, confirm that our release builds are actually building for it (related: Enable additional architectural optimizations in release #1492)
hw/xbox/nv2a/swizzle.c
Outdated
mask = (mask ^ mv) | (mv >> (1 << i)); // Compress m. | ||
mk = mk & ~mp; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space
hw/xbox/nv2a/swizzle.c
Outdated
generate_swizzle_masks(width, height, depth, &mask_x, &mask_y, &mask_z); | ||
expand_mask mask_x, mask_y, mask_z; | ||
generate_swizzle_masks(width, height, depth, &mask_x.mask, &mask_y.mask, &mask_z.mask); | ||
generate_expand_mask_moves(&mask_x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculation unnecessary in case of pdep instruction?
hw/xbox/nv2a/swizzle.c
Outdated
* expand(0000abcd, 10011010) = a00bc0d0. | ||
* | ||
* Implementation from Hacker's delight chapter 7 "Expand" | ||
* https://stackoverflow.com/questions/77834169/what-is-a-fast-fallback-algorithm-which-emulates-pdep-and-pext-in-software |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Citing the book should be enough, don't need the stackoverflow.com link
hw/xbox/nv2a/swizzle.c
Outdated
} | ||
|
||
static uint32_t expand(uint32_t x, expand_mask* expand_mask) { | ||
#ifdef __BMI2__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Readability would be better with these macros unindented
hw/xbox/nv2a/swizzle.c
Outdated
uint32_t moves[5]; | ||
} expand_mask; | ||
|
||
static void generate_expand_mask_moves(expand_mask* expand_mask) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Asterisk goes next to identifier
based on https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/ - Increment swizzled offsets, instead of calculating them from scratch for each pixel. This seems to be around 30X faster. - Add/update comments, including high-level comment on what swizzle.c is for. This improves FPS in certain areas in Fable: The Lost Chapters
af61911
to
2d8596c
Compare
I've changed the implementation to get the same performance without BMI2, and have updated the PR description |
This speeds up swizzling/unswizzling textures.
fill_pattern
showed up in profiling Fable: The Lost Chapters.It seems to implement the equivalent of thepdep
instruction.It turns out there is a slightly faster method to do this in a fixed number of steps.
This increases fps a bit in certain scenes in Fable TLC.
Also use nativepdep
if available.Instead of calculating these offsets from masks, we can increment through the offsets.
See https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/
I wrote a small testbench for different methods to ensure it's working the same way (not integrated into xemu)
It should be around 30X faster and is matching both swizzle/unswizzle behaviour of the original code.
Gives a few more FPS in certain situations in Fable TLC:
Fable TLC FPS comparison
Before:
After: