This repository:
- Establishes a terse syntax for describing a broad range of CRC32 implementations.
- Contains a code generator for turning terse descriptions into executable C code.
- Contains benchmarking apparatus for CRC32 implementations, along with a range of benchmark results.
- Presents some novel CRC32C implementations which are faster than the previous state of the art.
- Apple M1: 9.6% faster (77.55 -> 85.05 GB/s)
- Cascade Lake: 25% faster (25.32 -> 31.55 GB/s)
- Ice Lake: 56% faster (41.00 -> 63.98 GB/s)
- Sapphire Rapids: 19% faster (81.68 -> 97.30 GB/s)
Running make
will generate and compile and benchmark a handful of different implementations.
Alternatively, run make generate
and then use ./generate
to generate a particular implementation.
Terse syntax | gcc equivalent | Example CPUs |
---|---|---|
-i neon |
aarch64 -march=armv8-a+crypto+crc |
Apple M1, GCP Tau T2A (Ampere Altra Arm) |
-i neon_eor3 |
aarch64 -march=armv8.2-a+crypto+sha3 |
Apple M1 |
-i sse |
x86_64 -msse4.2 -mpclmul |
Any Intel or AMD CPU from the last 10 years |
-i avx512 |
x86_64 -mavx512f -mavx512vl |
Intel Skylake X and newer, AMD Zen 4 |
-i avx512_vpclmulqdq |
x86_64 -mavx512f -mavx512vl -mvpclmulqdq |
Intel Ice Lake and newer, AMD Zen 4 |
For CRC32, the key feature required of any instruction set is a vector carryless multiplication instruction, e.g.
pclmulqdq
on x86_64 or pmull
on aarch64. Three-way exclusive-or is also useful, e.g. vpternlogq
on x86_64 or
eor3
on aarch64.
Terse syntax | Polynomial | Hardware acceleration | Example applications |
---|---|---|---|
-p crc32 |
0x04C11DB7 | aarch64 | Ethernet, SATA, zlib, png |
-p crc32c |
0x1EDC6F41 | aarch64, x86_64 | iSCSI, Btrfs, ext4 |
-p crc32k |
0x741B8CD7 | no | |
-p crc32q |
0x814141AB | no | AIXM |
-p 0x87654321 |
0x87654321 | no | none |
Most new applications should probably use -p crc32c
. A large number of existing applications use -p crc32
. If you
really care about this, then you can also specify an arbitrary polynomial.
Some instruction sets contain scalar instructions for accelerating particular CRC32 polynomials. Notably, aarch64 has
scalar acceleration for -p crc32
and -p crc32c
, whereas x86_64 only has scalar acceleration for -p crc32c
.
Six different parameters control the generated algorithm:
- Number of vector accumulators to use (16 or 64 bytes each).
- Number of vector loads to perform per iteration (defaults to number of vector accumulators, and must be a multiple of the number of vector accumulators).
- Number of scalar accumulators to use (4 bytes each).
- Number of scalar 8-byte loads to perform per iteration (defaults to number of scalar accumulators, and must be a multiple of the number of scalar accumulators).
- The outer loop size, if any.
- Whether inner loop termination is length-based or pointer-based.
The number of vector accumulators is specified by the letter v
, followed by a decimal number. If the number of vector loads per iteration is non-default, this is followed by the letter x
, followed by the ratio of loads to accumulators as a decimal number. For example, v4
denotes four vector accumulators and four vector loads per iteration, and v3x2
denotes three vector accumulators and six vector loads per iteration.
The number of scalar accumulators is specified in a similar way, but with s
instead of v
. For example, s3
denotes three scalar accumulators and three scalar loads per iteration, and s5x3
denotes five scalar accumulators and 15 scalar loads per iteration.
An outer loop size (in bytes) can be specified by the letter k
, followed by a decimal number. Using an outer loop slightly decreases the complexity of the inner loop (because the inner loop trip count becomes constant), but makes the implementation less well suited to input sizes that aren't a multiple of the outer loop size.
The inner loop condition can be based on the remaining length (the default), or based on pointer comparisons (specify the letter e
). Length-based is easier for humans to comprehend. Pointer-based can make the inner loop slightly simpler, at the expense of slightly more logic outside the loop.
The various parameters are concatenated to give the algorithm string, for example:
-a v4
to use four vector registers.-a v4x2s3x5k4096e
to use four vector registers (8 vector loads per iteration), three scalar registers (15 scalar loads per iteration), outer block size 4096 bytes, and pointer-based loop termination.
Multiple sets of parameters can be separated by an _
character, for example -a v4_v1
uses a v4
algorithm for as long as possible, then switches to v1
for any remaining bytes. If there are still any remaining bytes, then an implicit _s1
is appended to deal with them.
Implementation | Speed | Our equivalent | Speed |
---|---|---|---|
Chromium scalar | 25.03 GB/s | -i neon -p crc32 -a s3k95760_s3 |
25.50 GB/s |
Chromium vector | 26.57 GB/s | -i neon -p crc32 -a v4_v1 |
34.03 GB/s |
dougallj Faster CRC32 | 77.55 GB/s | -i neon -p crc32 -a v12e_v1 |
77.69 GB/s |
fusion-inspired | N/A | -i neon_eor3 -p crc32 -a v9s3x2e_s3 |
85.05 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_neon_crc32_s3k95760_s3.dylib chromium_neon_crc32_v4_v1.dylib dougallj_neon_crc32_v12e_v1.dylib
$ ./bench -r 40 third_party/chromium_neon_crc32_s3k95760_s3.dylib third_party/chromium_neon_crc32_v4_v1.dylib third_party/dougallj_neon_crc32_v12e_v1.dylib
$ ./autobench -r 40 -i neon -p crc32 -a s3k95760_s3,v4_v1,v12e_v1 -i neon_eor3 -a v9s3x2e_s3
Rich microarchitecture details are available, though the key details for CRC32 distill down to the following table:
Primitive | Uops | Load | Scalar | Vector | Bytes | Score | Latency |
---|---|---|---|---|---|---|---|
ldr, pmull+eor, pmull2+eor |
5/8 | 1/3 or 0.5/3 | 2/4 | 16 | 25.6 | 6 | |
ldr, pmull, pmull2, eor3 |
4/8 | 1/3 or 0.5/3 | 3/4 | 16 | 21.3 | 5 | |
ldr, pmull, pmull2, eor, eor |
5/8 | 1/3 or 0.5/3 | 4/4 | 16 | 16.0 | 7 | |
ldr, crc32x |
2/8 | 1/3 or 0.5/3 | 1/1 | 8 | 8.0 | 3 |
Score is computed as Bytes / max(Uops, Load, Scalar, Vector)
, and gives the optimal bytes per cycle that we can
hope to achieve with the corresponding primitive. The compiler might fuse two consecutive ldr
instructions into a
single ldp
instruction, which reduces Load
from 1/3
to 0.5/3
, but does not decrease Uops
. The CPU can
fuse eor
into an immediately preceding pmull
or pmull2
, which decreases the demand on the vector execution
units, but such fused instructions still count as two for Uops
. Chromium's scalar implementation is a decent
application of the ldr, crc32x
primitive, with three scalar accumulators sufficient to match the latency of 3.
Chromium's vector implementation is a poor application of the ldr, pmull, pmull2, eor, eor
primitive, as the four
vector accumulators are insufficient to match the latency of 7. Our v4_v1
does better by using the
ldr, pmull+eor, pmull2+eor
primitive, but the four vector accumulators are still insufficient to match the latency
of 6. Dougallj does a good application of the ldr, pmull+eor, pmull2+eor
primitive, using 12 vector accumulators to
match the latency. The fun twist is the ldr, pmull, pmull2, eor3
primitive; it scores worse than
ldr, pmull+eor, pmull2+eor
, so if just one primitive is used, ldr, pmull, pmull2, eor3
will lose out to
ldr, pmull+eor, pmull2+eor
, but ldr, pmull, pmull2, eor3
is less demanding on Uops
, so a blend of
ldr, pmull, pmull2, eor3
with ldr, crc32x
can come out ahead. The v9s3x2e_s3
implementation blends nine copies
of ldr, pmull, pmull2, eor3
with six copies of ldr, crc32x
, which gives Uops = 48/8, Scalar = 6/1, Vector = 27/4,
Bytes = 192, and thus a score of 28.4, which puts it just ahead of ldr, pmull+eor, pmull2+eor
.
Implementation | Speed | Our equivalent | Speed |
---|---|---|---|
Chromium scalar | 22.68 GB/s | -i neon -p crc32 -a s3k95760_s3 |
23.57 GB/s |
Chromium vector | 16.91 GB/s | -i neon -p crc32 -a v4_v1 |
20.94 GB/s |
dougallj Faster CRC32 | 17.41 GB/s | -i neon -p crc32 -a v12e_v1 |
21.81 GB/s |
fusion-inspired | N/A | -i neon -p crc32 -a v3s4x2e_v2 |
35.87 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_neon_crc32_s3k95760_s3.so chromium_neon_crc32_v4_v1.so dougallj_neon_crc32_v12e_v1.so
$ ./bench -r 40 third_party/chromium_neon_crc32_s3k95760_s3.so third_party/chromium_neon_crc32_v4_v1.so third_party/dougallj_neon_crc32_v12e_v1.so
$ ./autobench -r 40 -i neon -p crc32 -a s3k95760_s3,v4_v1,v12e_v1,v3s4x2e_v2
The CPU datasheet says:
Four-wide superscalar aggressive out-of-order execution CPU
Dual full-width (128-bit wide) SIMD execution pipes
Meanwhile, microarchitecture measurements suggest that:
crc32x
/crc32cx
have latency 2, throughput 1pmull
has latency 2, throughput 1- vector
eor
has latency 2, throughput 0.5
These latency and throughput numbers are consistent with Chromium scalar being faster than Chromium vector. The optimal fusion also involves more bytes being processed by the scalar pipes (64 bytes per iteration) than the vector pipes (48 bytes per iteration).
Relevant /proc/cpuinfo
of test machine:
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
stepping : 7
microcode : 0x5003006
cpu MHz : 2199.998
cache size : 16384 KB
Implementation | Speed | Our equivalent | Speed |
---|---|---|---|
Chromium vector | 15.87 GB/s | -i sse -p crc32 -a v4_v1 |
16.48 GB/s |
crc32_4k_three_way | 15.04 GB/s | -i sse -p crc32c -a s3k4096e |
15.11 GB/s |
crc32_4k_pclmulqdq | 15.78 GB/s | -i sse -p crc32c -a v4k4096e |
15.63 GB/s |
retuned crc32_4k_pclmulqdq | N/A | -i sse -p crc32c -a v4e |
16.03 GB/s |
AVX512 crc32_4k_pclmulqdq | N/A | -i avx512 -p crc32c -a v4e |
17.36 GB/s |
crc32_4k_fusion | 25.32 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
25.42 GB/s |
retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v8s3x3 |
28.55 GB/s |
AVX512 crc32_4k_fusion | N/A | -i avx512 -p crc32c -a v9s3x4e |
31.55 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,v4k4096e,v4e -i avx512 -a v4e -i sse -a v4s3x3k4096e,v8s3x3 -i avx512 -a v9s3x4e
Note that Cascade Lake lacks VPCLMULQDQ, so the only useful part of AVX512 is VPTERNLOGQ, hence the only very limited performance increase from AVX512 here.
Relevant /proc/cpuinfo
of test machine:
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) CPU @ 2.60GHz
stepping : 6
microcode : 0xffffffff
cpu MHz : 2600.028
cache size : 55296 KB
Implementation | Speed | Our equivalent | Speed |
---|---|---|---|
Chromium vector | 24.08 GB/s | -i sse -p crc32 -a v4_v1 |
24.02 GB/s |
Chromium vector AVX512 | 41.00 GB/s | -i avx512_vpclmulqdq -p crc32 -a v4_v1 |
47.31 GB/s |
crc32_4k_three_way | 19.68 GB/s | -i sse -p crc32c -a s3k4096e |
19.80 GB/s |
retuned crc32_4k_three_way | N/A | -i sse -p crc32c -a s3 |
22.50 GB/s |
crc32_4k_pclmulqdq | 22.80 GB/s | -i sse -p crc32c -a v4k4096e |
22.80 GB/s |
retuned crc32_4k_pclmulqdq | N/A | -i sse -p crc32c -a v4e |
24.05 GB/s |
AVX512 crc32_4k_pclmulqdq | N/A | -i avx512_vpclmulqdq -p crc32c -a v4x2e |
47.50 GB/s |
crc32_4k_fusion | 35.12 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
34.72 GB/s |
retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v7s3x3 |
42.58 GB/s |
AVX512 crc32_4k_fusion | N/A | -i avx512_vpclmulqdq -p crc32c -a v4s5x3 |
63.98 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e,v4e -i avx512_vpclmulqdq -a v4e -i sse -a v4s3x3k4096e,v7s3x3 -i avx512_vpclmulqdq -a v4s5x3
Ice Lake introduces VPCLMULQDQ, but VPCLMULQDQ on 64-byte registers is 1*p05+2*p5 (versus regular PCLMULQDQ on 16-byte registers being 1*p5), so going from 16-byte to 64-byte registers only gives a ~2x speedup. Chromium vector AVX512 leaves some performance on the table due to not using VPTERNLOGQ.
Relevant /proc/cpuinfo
of test machine:
vendor_id : GenuineIntel
cpu family : 6
model : 143
model name : Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
stepping : 8
microcode : 0xffffffff
cpu MHz : 2699.998
cache size : 107520 KB
Implementation | Speed | Our equivalent | Speed |
---|---|---|---|
Chromium vector | 23.42 GB/s | -i sse -p crc32 -a v4_v1 |
23.31 GB/s |
Chromium vector AVX512 | 65.20 GB/s | -i avx512_vpclmulqdq -p crc32 -a v4_v1 |
94.00 GB/s |
(with aligned inputs) | 81.68 GB/s | -i avx512_vpclmulqdq -p crc32 -a v4_v1 |
94.00 GB/s |
crc32_4k_three_way | 17.45 GB/s | -i sse -p crc32c -a s3k4096e |
17.59 GB/s |
retuned crc32_4k_three_way | N/A | -i sse -p crc32c -a s3 |
23.84 GB/s |
crc32_4k_pclmulqdq | 22.79 GB/s | -i sse -p crc32c -a v4k4096e |
22.79 GB/s |
retuned crc32_4k_pclmulqdq | N/A | -i sse -p crc32c -a v4e |
23.29 GB/s |
AVX512 crc32_4k_pclmulqdq | N/A | -i avx512_vpclmulqdq -p crc32c -a v4e |
94.84 GB/s |
crc32_4k_fusion | 33.87 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
34.68 GB/s |
retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v8s3x3 |
36.98 GB/s |
AVX512 crc32_4k_fusion | N/A | -i avx512_vpclmulqdq -p crc32c -a v3s1_s3 |
97.30 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e,v4e -i avx512_vpclmulqdq -a v4e -i sse -a v4s3x3k4096e,v8s3x3 -i avx512_vpclmulqdq -a v3s1_s3
Sapphire Rapids improves VPCLMULQDQ on 64-byte registers to 1*p5, meaning that AVX512 implementations can approach 4x the performance of implementations using 16-byte vectors (and slightly exceed 4x when also gaining VPTERNLOGQ). One catch is that we now observe a 20% performance penalty for unaligned 64-byte loads, which the Chromium code doesn't take care to handle.
Relevant /proc/cpuinfo
of test machine:
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7B12
stepping : 0
microcode : 0xffffffff
cpu MHz : 2249.998
cache size : 512 KB
Implementation | Speed | Our equivalent | Speed |
---|---|---|---|
Chromium vector | 12.89 GB/s | -i sse -p crc32 -a v4_v1 |
12.84 GB/s |
crc32_4k_three_way | 23.52 GB/s | -i sse -p crc32c -a s3k4096e |
23.08 GB/s |
crc32_4k_pclmulqdq | 12.86 GB/s | -i sse -p crc32c -a v4k4096e |
12.89 GB/s |
crc32_4k_fusion | 23.84 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
22.62 GB/s |
retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v1s3x3 |
31.16 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -p crc32c -a s3k4096e,v4k4096e,v4s3x3k4096e,v1s3x3
Relevant /proc/cpuinfo
of test machine:
vendor_id : AuthenticAMD
cpu family : 25
model : 1
model name : AMD EPYC 7B13
stepping : 0
microcode : 0xffffffff
cpu MHz : 2450.000
cache size : 512 KB
Implementation | Speed | Our equivalent | Speed |
---|---|---|---|
Chromium vector | 12.82 GB/s | -i sse -p crc32 -a v4_v1 |
12.85 GB/s |
crc32_4k_three_way | 22.75 GB/s | -i sse -p crc32c -a s3k4096e |
23.55 GB/s |
crc32_4k_pclmulqdq | 12.76 GB/s | -i sse -p crc32c -a v4k4096e |
12.86 GB/s |
crc32_4k_fusion | 23.41 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
22.64 GB/s |
retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v1s4x2 |
31.76 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -p crc32c -a s3k4096e,v4k4096e,v4s3x3k4096e,v1s4x2
Relevant /proc/cpuinfo
of test machine:
vendor_id : AuthenticAMD
cpu family : 25
model : 17
model name : AMD EPYC 9B14
stepping : 1
microcode : 0xffffffff
cpu MHz : 2599.998
cache size : 1024 KB
Implementation | Speed | Our equivalent | Speed |
---|---|---|---|
Chromium vector | 13.73 GB/s | -i sse -p crc32 -a v4_v1 |
13.72 GB/s |
Chromium vector AVX512 | 51.70 GB/s | -i avx512_vpclmulqdq -p crc32 -a v4_v1 |
54.00 GB/s |
crc32_4k_three_way | 26.27 GB/s | -i sse -p crc32c -a s3k4096e |
26.56 GB/s |
retuned crc32_4k_three_way | N/A | -i sse -p crc32c -a s3 |
27.32 GB/s |
crc32_4k_pclmulqdq | 13.73 GB/s | -i sse -p crc32c -a v4k4096e |
13.74 GB/s |
AVX512 crc32_4k_pclmulqdq | N/A | -i avx512_vpclmulqdq -p crc32c -a v3x2 |
54.77 GB/s |
crc32_4k_fusion | 28.47 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
28.56 GB/s |
retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v1s3x2 |
36.20 GB/s |
AVX512 crc32_4k_fusion | N/A | -i avx512_vpclmulqdq -p crc32c -a v3s2x4 |
71.95 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e -i avx512_vpclmulqdq -a v3x2 -i sse -a v4s3x3k4096e,v1s3x2 -i avx512_vpclmulqdq -a v3s2x4
Background mathematics:
CRC implementations: