-
Notifications
You must be signed in to change notification settings - Fork 14.9k
Open
Description
Normally storing a 64-bit value "by hand" byte by byte into a byte array is recognized and results in a single store (see code generated for write_good below).
However doing some integer operations on the value first breaks that optimization, resulting in 17 to 25 instructions instead of 3 to 4.
I originally assume it was due to the shift merging, but it also happens without the shift.
Checked both aarch64 and x86_64.
gcc handles both cases fine since gcc 8.
void write_bad(unsigned char *buffer, unsigned char a, unsigned long long b)
{
unsigned long long v = (a & 0xf) | (b << 5);
buffer[0] = v;
buffer[1] = v >> 8;
buffer[2] = v >> 16;
buffer[3] = v >> 24;
buffer[4] = v >> 32;
buffer[5] = v >> 40;
buffer[6] = v >> 48;
buffer[7] = v >> 56;
}
void write_good(unsigned char *buffer, unsigned long long v)
{
buffer[0] = v;
buffer[1] = v >> 8;
buffer[2] = v >> 16;
buffer[3] = v >> 24;
buffer[4] = v >> 32;
buffer[5] = v >> 40;
buffer[6] = v >> 48;
buffer[7] = v >> 56;
}
Generated code according to godbolt.org:
write_bad:
and sil, 15
mov eax, edx
shl eax, 5
or al, sil
mov byte ptr [rdi], al
mov eax, edx
shr eax, 3
mov byte ptr [rdi + 1], al
mov eax, edx
shr eax, 11
mov byte ptr [rdi + 2], al
mov eax, edx
shr eax, 19
mov byte ptr [rdi + 3], al
mov rax, rdx
shr rax, 27
mov byte ptr [rdi + 4], al
mov rax, rdx
shr rax, 35
mov byte ptr [rdi + 5], al
mov rax, rdx
shr rax, 43
mov byte ptr [rdi + 6], al
shr rdx, 51
mov byte ptr [rdi + 7], dl
ret
write_good:
mov qword ptr [rdi], rsi
ret