Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc Updates #57

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Misc Updates #57

wants to merge 14 commits into from

Conversation

Dr-Emann
Copy link
Collaborator

@Dr-Emann Dr-Emann commented Oct 10, 2023

  • Update to rust edition 2018, remove unneeded extern crates
  • Remove trait bounds from types. This is unlikely to actually impact anyone, but could reduce having to propagate bounds to any structs which contain these types.
  • Port benchmarking to criterion, for more thorough benchmarking, and the ability to run benchmarking on stable
  • Follow some clippy suggestions to mark functions as #[must_use]
  • Update to maintained versions of (dev) dependencies

@Dr-Emann
Copy link
Collaborator Author

Dr-Emann commented Oct 10, 2023

Here's the benchmark results for my machines. It looks like jetscii is somewhat behind memchr (and sometimes std) in most cases, unfortunately.

  • At least on my machines, it looks like memchr's memmem always beats our substring search.
  • It looks like our fallback implementation is pretty weak, and gets trounced by memchr, and even often std, this is probably at least partly because of the required dynamic dispatch per character if using a static object.
    • We might be able to do something like jetscii!{ static XML5: AsciiChars = b"<>&'\""; }, and allow the macro to use a monomorphized type somehow?
    • We might be able to have the fallback function do the searching, which would amortize the dynamic function call overhead
  • For small-medium numbers of characters, even with the SSE implementation, memchr may be an alternative if you're going to be searching the whole string anyway, not once for the first character in a set.
Windows x64 Benchmark Results

Environment details:

Windows 10, Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, AES, AVX, AVX2, FMA3, TSX

find_last_space/ascii_chars
                        time:   [2.3670 ms 2.4117 ms 2.4535 ms]
                        thrpt:  [1.9901 GiB/s 2.0247 GiB/s 2.0629 GiB/s]
find_last_space/stdlib_find_string
                        time:   [16.520 ms 16.725 ms 16.907 ms]
                        thrpt:  [295.73 MiB/s 298.95 MiB/s 302.66 MiB/s]
find_last_space/stdlib_find_char
                        time:   [1.7436 ms 1.7464 ms 1.7493 ms]
                        thrpt:  [2.7914 GiB/s 2.7959 GiB/s 2.8004 GiB/s]
find_last_space/stdlib_find_char_set
                        time:   [17.826 ms 18.047 ms 18.238 ms]
                        thrpt:  [274.15 MiB/s 277.05 MiB/s 280.49 MiB/s]
find_last_space/stdlib_find_closure
                        time:   [17.683 ms 17.946 ms 18.181 ms]
                        thrpt:  [275.01 MiB/s 278.62 MiB/s 282.75 MiB/s]
find_last_space/stdlib_iter_position
                        time:   [13.672 ms 13.694 ms 13.720 ms]
                        thrpt:  [364.43 MiB/s 365.13 MiB/s 365.71 MiB/s]
find_last_space/memchr  time:   [480.79 µs 485.71 µs 491.34 µs]
                        thrpt:  [9.9376 GiB/s 10.053 GiB/s 10.156 GiB/s]

find_xml_3/ascii_chars  time:   [2.5123 ms 2.5158 ms 2.5195 ms]
                        thrpt:  [1.9380 GiB/s 1.9409 GiB/s 1.9436 GiB/s]
find_xml_3/stdlib_find_char_set
                        time:   [18.783 ms 19.131 ms 19.463 ms]
                        thrpt:  [256.90 MiB/s 261.35 MiB/s 266.20 MiB/s]
find_xml_3/stdlib_find_closure
                        time:   [20.092 ms 20.114 ms 20.137 ms]
                        thrpt:  [248.30 MiB/s 248.59 MiB/s 248.85 MiB/s]
find_xml_3/stdlib_iter_position
                        time:   [6.7072 ms 6.7163 ms 6.7257 ms]
                        thrpt:  [743.42 MiB/s 744.45 MiB/s 745.46 MiB/s]
find_xml_3/memchr       time:   [672.72 µs 688.70 µs 703.63 µs]
                        thrpt:  [6.9394 GiB/s 7.0899 GiB/s 7.2583 GiB/s]

find_xml_5/ascii_chars  time:   [2.4397 ms 2.4723 ms 2.5012 ms]
                        thrpt:  [1.9522 GiB/s 1.9750 GiB/s 2.0014 GiB/s]
find_xml_5/stdlib_find_char_set
                        time:   [17.258 ms 17.834 ms 18.395 ms]
                        thrpt:  [271.81 MiB/s 280.36 MiB/s 289.72 MiB/s]
find_xml_5/stdlib_find_closure
                        time:   [20.058 ms 20.079 ms 20.101 ms]
                        thrpt:  [248.74 MiB/s 249.02 MiB/s 249.28 MiB/s]
find_xml_5/stdlib_iter_position
                        time:   [6.5170 ms 6.6021 ms 6.6773 ms]
                        thrpt:  [748.80 MiB/s 757.34 MiB/s 767.22 MiB/s]
find_xml_5/memchr       time:   [1.2204 ms 1.2601 ms 1.3022 ms]
                        thrpt:  [3.7498 GiB/s 3.8749 GiB/s 4.0009 GiB/s]

find_big_16/ascii_chars time:   [2.5150 ms 2.5193 ms 2.5242 ms]
                        thrpt:  [1.9344 GiB/s 1.9381 GiB/s 1.9415 GiB/s]
find_big_16/stdlib_find_char_set
                        time:   [20.098 ms 20.123 ms 20.150 ms]
                        thrpt:  [248.14 MiB/s 248.47 MiB/s 248.78 MiB/s]
find_big_16/stdlib_find_closure
                        time:   [14.302 ms 14.907 ms 15.536 ms]
                        thrpt:  [321.84 MiB/s 335.42 MiB/s 349.61 MiB/s]
find_big_16/stdlib_iter_position
                        time:   [14.219 ms 14.408 ms 14.609 ms]
                        thrpt:  [342.25 MiB/s 347.02 MiB/s 351.64 MiB/s]
find_big_16/memchr      time:   [3.5348 ms 3.6534 ms 3.7668 ms]
                        thrpt:  [1.2963 GiB/s 1.3365 GiB/s 1.3814 GiB/s]

find_big_16_early_return/ascii_chars
                        time:   [21.674 ns 21.716 ns 21.767 ns]
                        thrpt:  [43.813 MiB/s 43.916 MiB/s 44.001 MiB/s]
find_big_16_early_return/stdlib_find_char_set
                        time:   [7.3062 ns 7.5579 ns 7.7857 ns]
                        thrpt:  [122.49 MiB/s 126.18 MiB/s 130.53 MiB/s]
find_big_16_early_return/stdlib_find_closure
                        time:   [9.2522 ns 9.2635 ns 9.2758 ns]
                        thrpt:  [102.81 MiB/s 102.95 MiB/s 103.08 MiB/s]
find_big_16_early_return/stdlib_iter_position
                        time:   [5.4156 ns 5.4234 ns 5.4331 ns]
                        thrpt:  [175.53 MiB/s 175.84 MiB/s 176.10 MiB/s]
find_big_16_early_return/memchr
                        time:   [3.5441 ms 3.5672 ms 3.5910 ms]
                        thrpt:  [278.48   B/s 280.33   B/s 282.16   B/s]

find_substring/substring
                        time:   [2.0862 ms 2.0987 ms 2.1088 ms]
                        thrpt:  [2.3154 GiB/s 2.3266 GiB/s 2.3405 GiB/s]
find_substring/stdlib_find_string
                        time:   [2.6564 ms 2.6819 ms 2.7030 ms]
                        thrpt:  [1.8064 GiB/s 1.8207 GiB/s 1.8381 GiB/s]
find_substring/memchr   time:   [767.40 µs 781.51 µs 792.49 µs]
                        thrpt:  [6.1613 GiB/s 6.2479 GiB/s 6.3628 GiB/s]

Macos M1 Pro Benchmarks
find_last_space/ascii_chars
                        time:   [4.9585 ms 4.9618 ms 4.9654 ms]
                        thrpt:  [1007.0 MiB/s 1007.7 MiB/s 1008.4 MiB/s]
find_last_space/stdlib_find_string
                        time:   [3.2210 ms 3.2276 ms 3.2361 ms]
                        thrpt:  [1.5089 GiB/s 1.5128 GiB/s 1.5159 GiB/s]
find_last_space/stdlib_find_char
                        time:   [244.10 µs 244.25 µs 244.42 µs]
                        thrpt:  [19.977 GiB/s 19.991 GiB/s 20.004 GiB/s]
find_last_space/stdlib_find_char_set
                        time:   [3.3083 ms 3.3103 ms 3.3125 ms]
                        thrpt:  [1.4740 GiB/s 1.4750 GiB/s 1.4759 GiB/s]
find_last_space/stdlib_find_closure
                        time:   [3.3053 ms 3.3072 ms 3.3093 ms]
                        thrpt:  [1.4755 GiB/s 1.4764 GiB/s 1.4773 GiB/s]
find_last_space/stdlib_iter_position
                        time:   [1.6432 ms 1.6458 ms 1.6483 ms]
                        thrpt:  [2.9623 GiB/s 2.9668 GiB/s 2.9715 GiB/s]
find_last_space/memchr  time:   [62.832 µs 62.871 µs 62.910 µs]
                        thrpt:  [77.616 GiB/s 77.664 GiB/s 77.713 GiB/s]

find_xml_3/ascii_chars  time:   [4.9489 ms 4.9521 ms 4.9556 ms]
                        thrpt:  [1009.0 MiB/s 1009.7 MiB/s 1010.3 MiB/s]
find_xml_3/stdlib_find_char_set
                        time:   [3.5181 ms 3.5205 ms 3.5231 ms]
                        thrpt:  [1.3859 GiB/s 1.3870 GiB/s 1.3879 GiB/s]
find_xml_3/stdlib_find_closure
                        time:   [4.7493 ms 4.7526 ms 4.7561 ms]
                        thrpt:  [1.0266 GiB/s 1.0274 GiB/s 1.0281 GiB/s]
find_xml_3/stdlib_iter_position
                        time:   [2.4937 ms 2.4952 ms 2.4969 ms]
                        thrpt:  [1.9556 GiB/s 1.9569 GiB/s 1.9581 GiB/s]
find_xml_3/memchr       time:   [157.27 µs 157.51 µs 157.72 µs]
                        thrpt:  [30.958 GiB/s 31.001 GiB/s 31.048 GiB/s]

find_xml_5/ascii_chars  time:   [4.9523 ms 4.9568 ms 4.9621 ms]
                        thrpt:  [1007.6 MiB/s 1008.7 MiB/s 1009.6 MiB/s]
find_xml_5/stdlib_find_char_set
                        time:   [3.5223 ms 3.5243 ms 3.5265 ms]
                        thrpt:  [1.3846 GiB/s 1.3855 GiB/s 1.3862 GiB/s]
find_xml_5/stdlib_find_closure
                        time:   [4.4779 ms 4.4810 ms 4.4844 ms]
                        thrpt:  [1.0888 GiB/s 1.0897 GiB/s 1.0904 GiB/s]
find_xml_5/stdlib_iter_position
                        time:   [2.4932 ms 2.4950 ms 2.4970 ms]
                        thrpt:  [1.9555 GiB/s 1.9570 GiB/s 1.9585 GiB/s]
find_xml_5/memchr       time:   [275.68 µs 276.01 µs 276.34 µs]
                        thrpt:  [17.670 GiB/s 17.691 GiB/s 17.712 GiB/s]

find_big_16/ascii_chars time:   [4.9420 ms 4.9464 ms 4.9507 ms]
                        thrpt:  [1010.0 MiB/s 1010.8 MiB/s 1011.7 MiB/s]
find_big_16/stdlib_find_char_set
                        time:   [3.3028 ms 3.3075 ms 3.3131 ms]
                        thrpt:  [1.4738 GiB/s 1.4763 GiB/s 1.4784 GiB/s]
find_big_16/stdlib_find_closure
                        time:   [3.3341 ms 3.3364 ms 3.3388 ms]
                        thrpt:  [1.4625 GiB/s 1.4635 GiB/s 1.4645 GiB/s]
find_big_16/stdlib_iter_position
                        time:   [2.2261 ms 2.2276 ms 2.2291 ms]
                        thrpt:  [2.1905 GiB/s 2.1919 GiB/s 2.1935 GiB/s]
find_big_16/memchr      time:   [628.92 µs 629.29 µs 629.68 µs]
                        thrpt:  [7.7545 GiB/s 7.7592 GiB/s 7.7638 GiB/s]

find_big_16_early_return/ascii_chars
                        time:   [961.88 ps 963.12 ps 964.20 ps]
                        thrpt:  [989.08 MiB/s 990.19 MiB/s 991.47 MiB/s]
find_big_16_early_return/stdlib_find_char_set
                        time:   [947.76 ps 950.89 ps 953.64 ps]
                        thrpt:  [1000.0 MiB/s 1002.9 MiB/s 1006.2 MiB/s]
find_big_16_early_return/stdlib_find_closure
                        time:   [1.2614 ns 1.2633 ns 1.2656 ns]
                        thrpt:  [753.55 MiB/s 754.92 MiB/s 756.07 MiB/s]
find_big_16_early_return/stdlib_iter_position
                        time:   [680.30 ps 680.87 ps 681.37 ps]
                        thrpt:  [1.3668 GiB/s 1.3678 GiB/s 1.3690 GiB/s]
find_big_16_early_return/memchr
                        time:   [565.92 µs 566.40 µs 566.84 µs]
                        thrpt:  [1.7228 KiB/s 1.7241 KiB/s 1.7256 KiB/s]

find_substring/substring
                        time:   [13.200 ms 13.211 ms 13.224 ms]
                        thrpt:  [378.11 MiB/s 378.46 MiB/s 378.78 MiB/s]
find_substring/stdlib_find_string
                        time:   [497.01 µs 497.94 µs 498.79 µs]
                        thrpt:  [9.7893 GiB/s 9.8060 GiB/s 9.8243 GiB/s]
find_substring/memchr   time:   [175.29 µs 175.48 µs 175.66 µs]
                        thrpt:  [27.797 GiB/s 27.826 GiB/s 27.855 GiB/s]

@shepmaster
Copy link
Owner

shepmaster commented Oct 13, 2023

Hey, this is wonderful, thank you! All of the code changes look fine and I'd be happy to merge them. That being said...

It looks like jetscii is somewhat behind memchr (and sometimes std) in most cases, unfortunately.

This is quite surprising! I ran the current set of benchmarks on my Windows machine running Ubuntu inside of WSL:

test MB/s
bench::xml_delim_5_ascii_chars 13515
bench::xml_delim_5_stdlib_find_char_closure 1412
bench::xml_delim_5_stdlib_find_char_set 1812
bench::xml_delim_5_stdlib_iterator_position 4215

Comparing to the criterion benchmarks:

name MB/s
find_xml_5/ascii_chars 6765
find_xml_5/stdlib_find_closure 1438
find_xml_5/stdlib_find_char_set 1844
find_xml_5/stdlib_iter_position 4360

Something seems quite suspicious as the speed of Jetscii is almost exactly half. Was the old benchmark flawed? Is criterion doing something different?

benches/benchmarks.rs Outdated Show resolved Hide resolved
benches/benchmarks.rs Outdated Show resolved Hide resolved
@shepmaster
Copy link
Owner

Hmm, CI is indicating that these can't be const yet, as the closure may need to be dropped. To my knowledge, there's no way of specifying that we only want closures that will not be dropped.

@Dr-Emann
Copy link
Collaborator Author

Interesting! In the case where we know we have SSE 4.2, we don't use the passed fallback at all, so it has to be dropped in the function.

Probably can just do some trickery so the macro result is const (probably with a #[doc(hidden)] somewhere), but the new fn won't be.

@Dr-Emann
Copy link
Collaborator Author

cargo semver-checks reports:

     Parsing jetscii v0.5.3 (current)
     Parsing jetscii v0.5.3 (baseline, cached)
    Checking jetscii v0.5.3 -> v0.5.3 (no change)
   Completed [   0.007s] 51 checks; 50 passed, 1 failed, 0 unnecessary

--- failure inherent_method_must_use_added: inherent method #[must_use] added ---

Description:
An inherent method is now #[must_use]. Downstream crates that did not use its return value will get a compiler lint.
        ref: https://doc.rust-lang.org/reference/attributes/diagnostics.html#the-must_use-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.24.0/src/lints/inherent_method_must_use_added.ron

Failed in:
  method jetscii::ByteSubstring::new in /Users/zach/Development/tmp/jetscii/src/lib.rs:321
  method jetscii::ByteSubstring::find in /Users/zach/Development/tmp/jetscii/src/lib.rs:343
  method jetscii::Substring::new in /Users/zach/Development/tmp/jetscii/src/lib.rs:359
  method jetscii::Substring::find in /Users/zach/Development/tmp/jetscii/src/lib.rs:372
       Final [   0.007s] semver requires new minor version: 0 major and 1 minor checks failed

I personally think that's fine, I don't mind giving extra warnings in a minor version bump if the previous use was clearly useless.

@Dr-Emann Dr-Emann force-pushed the updates branch 3 times, most recently from 319143a to b3468b6 Compare October 14, 2023 02:17
@Dr-Emann
Copy link
Collaborator Author

Something seems quite suspicious as the speed of Jetscii is almost exactly half. Was the old benchmark flawed? Is criterion doing something different?

I agree, and I can reproduce on my x64 machine. It's really confusing, I don't see anything wrong with either benchmark, and I don't see any difference (within 0.2% difference) on my m1 mac, or if I force the fallback impl on the x64 machine, but I see the 2x difference if I force the sse implementation or let the runtime switch pick the simd implementation, so it's somehow specific to the sse implementation.

@Dr-Emann
Copy link
Collaborator Author

Dr-Emann commented Oct 14, 2023

AHAH! I figured it out: see e8ace35.

Basically, just needed to add some #[inline]s, criterion compiles benchmarks as a separate crate, and some tiny functions not being eligible for cross-crate inlining made a huge impact. The criterion results were actually more accurate to what a user would have seen from using the library.

@Dr-Emann
Copy link
Collaborator Author

Updated the benchmark results comment with updated results.

benches/benchmarks.rs Outdated Show resolved Hide resolved
@dralley
Copy link

dralley commented Oct 15, 2023

I can confirm that the inlining improves the performance in practice

R5 3600

[dalley@localhost quick-xml]$ critcmp jetscii jetscii-fixed --filter escape
group                                       jetscii                                jetscii-fixed
-----                                       -------                                -------------
escape_text/escaped_chars_long              1.34    401.6±2.72ns        ? ?/sec    1.00    299.7±1.00ns        ? ?/sec
escape_text/escaped_chars_short             1.12    317.9±3.96ns        ? ?/sec    1.00    284.1±3.72ns        ? ?/sec
escape_text/no_chars_to_escape_long         2.02    191.4±0.51ns        ? ?/sec    1.00     94.8±0.02ns        ? ?/sec
escape_text/no_chars_to_escape_short        1.25     10.5±0.04ns        ? ?/sec    1.00      8.4±0.07ns        ? ?/sec

i7-8665U

[dalley@thinkpad quick-xml]$ critcmp jetscii jetscii-fixed --filter escape
group                                       jetscii                                jetscii-fixed
-----                                       -------                                -------------
escape_text/escaped_chars_long              1.44   559.0±55.57ns        ? ?/sec    1.00   389.5±20.32ns        ? ?/sec
escape_text/escaped_chars_short             1.01   383.4±12.31ns        ? ?/sec    1.00    379.2±8.76ns        ? ?/sec
escape_text/no_chars_to_escape_long         1.92   246.0±16.76ns        ? ?/sec    1.00    127.8±1.74ns        ? ?/sec
escape_text/no_chars_to_escape_short        1.32     12.7±0.22ns        ? ?/sec    1.00      9.6±0.73ns        ? ?/sec

@shepmaster
Copy link
Owner

Ok, I think I have sorted out CI. I merged my branch and yours and added a little tweak to set the target, as you had it. That combined CI run was green.

I think that means that you should be able to rebase on top of my changes, make the same tweak, and then we will be good for another review!

Dr-Emann and others added 8 commits October 17, 2023 21:33
Add a test that looks for the first item in a long haystack
The memmap crate is unmaintained, instead, use the maintained memmap2 crate
Structs don't need the bounds, only the implementations
Mostly just adding #[must_use]
This speeds up the criteron benchmarks by almost 2x

I believe this is needed because e.g. Bytes::find is inlined, and calls `find`
generically, which will call PackedCompareControl methods. So the code calling
the methods will be inlined into the calling crate, but the implemetations of
the PackedCompareControl are not accessable to the code in the calling crate,
so they will end up as actual function calls. However these functions are
_super_ simple, and inlining them helps a LOT, so adding `#[inline]` to these
functions, and making their implementation available to calling crates has a
huge effect.

This was only seen when moving to criterion because previously, nightly
benchmarks were implemented in the library crate itself, and so these functions
were already elegable for inlining. Criteron results were actually more
accurate to what callers of the crate would actually see!
Per suggestion from @BurntSushi [here](tafia/quick-xml#664 (comment))

On my M1, tt appears to be slower but competitive with memchr up to memchr3,
then start being the from 5-16
@Dr-Emann
Copy link
Collaborator Author

I'm thinking of reverting the changes to make the constructor const, since we might want to end up e.g. using memchr's memmem searcher, which wouldn't be const, and I don't know if we want to make the constructors const.

We may not want to be stuck with const-constructable implementations
@Dr-Emann
Copy link
Collaborator Author

The teddy results are pretty promising: on my machine, it seems to beat jetscii's simd implementation in everything but the "found a result on the first byte" test. I wonder if we can't just find a point and do if len < SMALL { fallback } else { teddy }

@BurntSushi
Copy link

@Dr-Emann Can you share your benchmark results?

@Dr-Emann
Copy link
Collaborator Author

Dr-Emann commented Oct 18, 2023

Sure:

Skylake i5

Including only ascii_chars, "teddy" and memchr

find_last_space/ascii_chars
                        time:   [1.2851 ms 1.4119 ms 1.5502 ms]
                        thrpt:  [3.1499 GiB/s 3.4582 GiB/s 3.7995 GiB/s]
find_last_space/teddy   time:   [648.84 µs 685.98 µs 727.01 µs]
                        thrpt:  [6.7163 GiB/s 7.1180 GiB/s 7.5254 GiB/s]
find_last_space/memchr  time:   [182.70 µs 185.98 µs 188.93 µs]
                        thrpt:  [25.845 GiB/s 26.254 GiB/s 26.725 GiB/s]

find_xml_3/ascii_chars  time:   [1.5516 ms 1.6435 ms 1.7471 ms]
                        thrpt:  [2.7948 GiB/s 2.9709 GiB/s 3.1469 GiB/s]
find_xml_3/teddy        time:   [627.14 µs 645.32 µs 663.75 µs]
                        thrpt:  [7.3564 GiB/s 7.5665 GiB/s 7.7858 GiB/s]
find_xml_3/memchr       time:   [492.12 µs 513.30 µs 537.36 µs]
                        thrpt:  [9.0866 GiB/s 9.5126 GiB/s 9.9220 GiB/s]

find_xml_5/ascii_chars  time:   [1.0851 ms 1.1917 ms 1.3254 ms]
                        thrpt:  [3.6841 GiB/s 4.0975 GiB/s 4.4997 GiB/s]
find_xml_5/teddy        time:   [268.58 µs 271.36 µs 274.06 µs]
                        thrpt:  [17.817 GiB/s 17.994 GiB/s 18.180 GiB/s]
find_xml_5/memchr       time:   [735.38 µs 794.74 µs 862.71 µs]
                        thrpt:  [5.6598 GiB/s 6.1439 GiB/s 6.6399 GiB/s]

find_big_16/ascii_chars time:   [1.3726 ms 1.4775 ms 1.6049 ms]
                        thrpt:  [3.0425 GiB/s 3.3048 GiB/s 3.5573 GiB/s]
find_big_16/teddy       time:   [272.21 µs 275.01 µs 278.07 µs]
                        thrpt:  [17.560 GiB/s 17.755 GiB/s 17.937 GiB/s]
find_big_16/memchr      time:   [2.6971 ms 2.8265 ms 2.9630 ms]
                        thrpt:  [1.6479 GiB/s 1.7275 GiB/s 1.8104 GiB/s]

find_big_16_early_return/ascii_chars
                        time:   [16.542 ns 17.138 ns 17.735 ns]
                        thrpt:  [53.773 MiB/s 55.647 MiB/s 57.652 MiB/s]
find_big_16_early_return/teddy
                        time:   [83.147 ns 85.570 ns 87.869 ns]
                        thrpt:  [10.853 MiB/s 11.145 MiB/s 11.470 MiB/s]
find_big_16_early_return/memchr
                        time:   [3.0589 ms 3.1347 ms 3.2110 ms]
                        thrpt:  [311.43   B/s 319.01   B/s 326.92   B/s]

Somehow teddy got a lot faster going from 3 patterns to 5?

M1 Mac
find_last_space/ascii_chars
                        time:   [4.8982 ms 4.9039 ms 4.9105 ms]
                        thrpt:  [1018.2 MiB/s 1019.6 MiB/s 1020.8 MiB/s]
find_last_space/teddy   time:   [213.33 µs 218.76 µs 225.05 µs]
                        thrpt:  [21.696 GiB/s 22.320 GiB/s 22.888 GiB/s]
find_last_space/memchr  time:   [62.353 µs 62.567 µs 62.814 µs]
                        thrpt:  [77.735 GiB/s 78.041 GiB/s 78.309 GiB/s]

find_xml_3/ascii_chars  time:   [5.1492 ms 5.1767 ms 5.2203 ms]
                        thrpt:  [957.80 MiB/s 965.86 MiB/s 971.02 MiB/s]
find_xml_3/teddy        time:   [214.30 µs 214.78 µs 215.23 µs]
                        thrpt:  [22.687 GiB/s 22.734 GiB/s 22.785 GiB/s]
find_xml_3/memchr       time:   [156.83 µs 157.77 µs 158.84 µs]
                        thrpt:  [30.741 GiB/s 30.949 GiB/s 31.134 GiB/s]

find_xml_5/ascii_chars  time:   [4.8970 ms 4.9074 ms 4.9205 ms]
                        thrpt:  [1016.2 MiB/s 1018.9 MiB/s 1021.0 MiB/s]
find_xml_5/teddy        time:   [205.09 µs 205.62 µs 206.19 µs]
                        thrpt:  [23.681 GiB/s 23.747 GiB/s 23.808 GiB/s]
find_xml_5/memchr       time:   [267.96 µs 268.15 µs 268.38 µs]
                        thrpt:  [18.194 GiB/s 18.209 GiB/s 18.222 GiB/s]

find_big_16/ascii_chars time:   [4.8841 ms 4.8865 ms 4.8892 ms]
                        thrpt:  [1022.7 MiB/s 1023.2 MiB/s 1023.7 MiB/s]
find_big_16/teddy       time:   [204.06 µs 204.21 µs 204.41 µs]
                        thrpt:  [23.888 GiB/s 23.911 GiB/s 23.928 GiB/s]
find_big_16/memchr      time:   [648.66 µs 648.89 µs 649.21 µs]
                        thrpt:  [7.5212 GiB/s 7.5248 GiB/s 7.5276 GiB/s]

find_big_16_early_return/ascii_chars
                        time:   [940.90 ps 942.13 ps 943.39 ps]
                        thrpt:  [1010.9 MiB/s 1012.3 MiB/s 1013.6 MiB/s]
find_big_16_early_return/teddy
                        time:   [9.7040 ns 9.7161 ns 9.7293 ns]
                        thrpt:  [98.021 MiB/s 98.154 MiB/s 98.277 MiB/s]
find_big_16_early_return/memchr
                        time:   [590.33 µs 591.29 µs 592.35 µs]
                        thrpt:  [1.6486 KiB/s 1.6516 KiB/s 1.6543 KiB/s]

No big jump in teddy performance, but it does become the fastest still between 3 and 5 characters, for this case at least.

@BurntSushi
Copy link

Somehow teddy got a lot faster going from 3 patterns to 5?

Interesting. If I get a chance tomorrow, I'll port your benchmark into aho-corasick's rebar benchmark suite and see if I can do some analysis for you.

BurntSushi added a commit to BurntSushi/aho-corasick that referenced this pull request Oct 20, 2023
There was some discussion about how to compare jetscii with Teddy and
some interesting benchmark results[1]. I decided to import the
benchmarks and see what things look like here.

[1]: shepmaster/jetscii#57
BurntSushi added a commit to BurntSushi/aho-corasick that referenced this pull request Oct 20, 2023
There was some discussion about how to compare jetscii with Teddy and
some interesting benchmark results[1]. I decided to import the
benchmarks and see what things look like here.

[1]: shepmaster/jetscii#57
@BurntSushi
Copy link

All righty, here we go.

I started by checking out this PR and running the benchmarks as written on my i9-12900K x86-64 CPU and my M2 mac mini aarch64 CPU. This should give us a comparison point with which to ground ourselves. On x86-64 (i9-12900K):

$ critcmp base -g '([^/]+/)(?:memchr|ascii|teddy).*' -f 'find_(big_16|last_space|xml)'
group                        base/ascii_chars                       base/memchr                                base/teddy
-----                        ----------------                       -----------                                ----------
find_last_space/             6.59   390.4±10.64µs    12.5 GB/sec    1.00     59.3±0.85µs    82.4 GB/sec        1.44     85.4±1.60µs    57.2 GB/sec
find_xml_3/                  5.17    381.6±5.27µs    12.8 GB/sec    1.00     73.8±1.60µs    66.1 GB/sec        1.08     79.6±0.82µs    61.4 GB/sec
find_xml_5/                  4.88    386.6±6.72µs    12.6 GB/sec    1.84    145.9±8.94µs    33.5 GB/sec        1.00     79.2±0.84µs    61.7 GB/sec
find_big_16/                 4.84    382.9±6.51µs    12.8 GB/sec    5.27    416.5±4.98µs    11.7 GB/sec        1.00     79.1±0.67µs    61.8 GB/sec
find_big_16_early_return/    1.00      2.6±0.11ns   369.9 MB/sec    141566.54   365.0±8.07µs     2.7 KB/sec    4.94     12.7±0.06ns    74.9 MB/sec

And on aarch64 (M2 mac mini):

$ critcmp base -g '([^/]+/)(?:memchr|ascii|teddy).*' -f 'find_(big_16|last_space|xml)'
group                        base/ascii_chars                       base/memchr                                base/teddy
-----                        ----------------                       -----------                                ----------
find_last_space/             77.06     4.5±0.00ms  1112.1 MB/sec    1.00     58.3±0.17µs    83.7 GB/sec        3.21    187.5±0.54µs    26.0 GB/sec
find_xml_3/                  31.93     4.5±0.00ms  1112.1 MB/sec    1.00    140.8±0.39µs    34.7 GB/sec        1.33    187.5±0.72µs    26.0 GB/sec
find_xml_5/                  23.98     4.5±0.00ms  1112.2 MB/sec    1.32    246.5±0.58µs    19.8 GB/sec        1.00    187.4±0.56µs    26.0 GB/sec
find_big_16/                 23.98     4.5±0.02ms  1111.7 MB/sec    3.19    597.5±1.59µs     8.2 GB/sec        1.00    187.6±0.89µs    26.0 GB/sec
find_big_16_early_return/    1.00      0.9±0.01ns  1052.6 MB/sec    598135.90   541.9±1.84µs     1845 B/sec    7.79      7.1±0.02ns   135.1 MB/sec

I then ported these benchmarks into aho-corasick's rebar benchmark suite. I didn't bother with porting the memchr benchmarks, since it's a little tricky to write that generically. From the root of aho-corasick's repository:

$ rebar build
$ rebar measure -f '^jetscii/' -e rust/aho-corasick/packed -e jetscii -t # for testing
$ rebar measure -f '^jetscii/' -e rust/aho-corasick/packed -e jetscii | tee tmp/results.csv
$ rebar cmp tmp/results.csv -f repeateda
benchmark                          rust/aho-corasick/packed/leftmost-first  rust/jetscii/ascii-chars/prebuilt
---------                          ---------------------------------------  ---------------------------------
jetscii/space-repeateda            53.9 GB/s (1.00x)                        12.9 GB/s (4.17x)
jetscii/xmldelim3-repeateda        56.9 GB/s (1.00x)                        12.9 GB/s (4.40x)
jetscii/xmldelim5-repeateda        58.6 GB/s (1.00x)                        12.9 GB/s (4.54x)
jetscii/big16-repeateda            59.0 GB/s (1.00x)                        12.9 GB/s (4.57x)
jetscii/big16earlyshort-repeateda  127.2 MB/s (1.00x)                       73.4 MB/s (1.73x)
jetscii/big16earlylong-repeateda   529.8 MB/s (1.42x)                       752.9 MB/s (1.00x)

And on aarch64 (M2 mac mini):

$ rebar cmp tmp/results.csv -f repeateda
benchmark                          rust/aho-corasick/packed/leftmost-first  rust/jetscii/ascii-chars/prebuilt
---------                          ---------------------------------------  ---------------------------------
jetscii/space-repeateda            26.1 GB/s (1.00x)                        1666.7 MB/s (16.03x)
jetscii/xmldelim3-repeateda        26.1 GB/s (1.00x)                        1666.7 MB/s (16.03x)
jetscii/xmldelim5-repeateda        26.1 GB/s (1.00x)                        1666.7 MB/s (16.03x)
jetscii/big16-repeateda            26.1 GB/s (1.00x)                        1666.7 MB/s (16.03x)
jetscii/big16earlyshort-repeateda  1907.3 MB/s (1.00x)                      1907.3 MB/s (1.00x)
jetscii/big16earlylong-repeateda   14.0 GB/s (1.00x)                        14.0 GB/s (1.00x)

So the first thing that jumps out at me is that Teddy has pretty consistent timings for space, xmldelim3, xmldelim5 and big16 in both rebar and Criterion. In other words, I'm not able to reproduce your dip in find_xml_3/xmldelim3.

The other interesting thing is the discrepancy in timings for the "early return" benchmark in Criterion versus rebar. I did have to tweak them somewhat, since the benchmarks I have look for all counts instead of just the first one. (rebar doesn't require this, and I could make it only do one search, but it didn't seem worth it.) So I split it into two: one benchmark whose haystack is Pa and another whose haystack is format!("P{}", "a".repeat(14)). In the former case, Teddy is a little faster and in the latter case, jetscii is a litter faster. This is definitely a "latency sensitive" benchmark where the timings are essentially a reflection of the overhead of a search call.

It is indeed somewhat common to try to optimize for the latency case by doing case analysis on the length of the haystack. memchr does this for its memmem search routine, for example. The catch is that as you add case analysis, the overhead of your search routine, and thus performance on latency sensitive workloads, tends to get worse. So in effect, there is a balancing point one might want to achieve. Ideally, Teddy would just do this automatically, and it does kind of try. Namely, for very short haystacks, it uses Rabin-Karp. I haven't spent a ton of time optimizing that case though.

One of the unfortunate things about latency sensitive workloads is that you are somewhat hosed already. You tend to burn a lot of time starting and stopping the search routine very frequently. This is why if you take just about any optimized search routine with high throughput and execute a search that has a high match count, that throughput will drop precipitously and you might not do much better (perhaps even worse) than the naive approach. The naive approach tends to do very well in latency sensitive workloads.

I don't have any good guesses as to the reason for the discrepancy in the "early return" benchmarks between Criterion and rebar. It's worth pointing out that we're dealing with low-single-digit nanosecond timings here, so even a small amount of noise could explain this discprenancy.

Another thing I did here was take a step back and look at the benchmark itself. At least the XML delimiter benchmarks look like they ought to be searching XML data. Instead, the benchmarks (except for the new "early return" one you added) are essentially all best cases for throughput: the haystack consists of the same byte repeated with no match, and finally followed by a single matching byte at the end. But what happens if we search a real XML document?

I picked a fairly small XML document describing some mental health stuff and ran the same benchmarks on it for x86-64:

$ rebar cmp tmp/results.csv -f mentalhealth
benchmark                       rust/aho-corasick/packed/leftmost-first  rust/jetscii/ascii-chars/prebuilt
---------                       ---------------------------------------  ---------------------------------
jetscii/space-mentalhealth      1201.2 MB/s (1.00x)                      1206.6 MB/s (1.00x)
jetscii/xmldelim3-mentalhealth  2.3 GB/s (1.00x)                         2.1 GB/s (1.07x)
jetscii/xmldelim5-mentalhealth  1928.2 MB/s (1.02x)                      1960.7 MB/s (1.00x)
jetscii/big16-mentalhealth      5.2 GB/s (1.00x)                         4.5 GB/s (1.16x)

And aarch64 (M2 mac mini):

$ rebar cmp tmp/results.csv -f mentalhealth
benchmark                       rust/aho-corasick/packed/leftmost-first  rust/jetscii/ascii-chars/prebuilt
---------                       ---------------------------------------  ---------------------------------
jetscii/space-mentalhealth      728.9 MB/s (1.55x)                       1129.7 MB/s (1.00x)
jetscii/xmldelim3-mentalhealth  1425.9 MB/s (1.01x)                      1446.2 MB/s (1.00x)
jetscii/xmldelim5-mentalhealth  1298.8 MB/s (1.11x)                      1436.0 MB/s (1.00x)
jetscii/big16-mentalhealth      3.5 GB/s (1.00x)                         1453.9 MB/s (2.49x)

Things here are quite a bit more competitive. My guess as to why this is, is because the benchmarks shift more towards latency sensitive workloads versus the pure throughput benchmarks in this PR. Essentially, as you move more and more towards latency sensitive workloads, differences between search routines tend to shrink, especially when comparing something that has very high throughput (Teddy) versus something that is less so (jetscii). Indeed, as I understand it, jetscii is based on the pcmpestri intrinsic from SSE4.2, and it's somewhat of a tortured technique because of its (documented) extremely high latency (18 cycles as written in the pcmpestri docs). This is why you won't find it used in Hyperscan at all. Here, it does look decentish for looking for a small set of bytes, but it is absolutely terrible for substring search. From the memchr repo (which already had jetscii in its rebar benchmarks) on x86-64:

$ rebar cmp benchmarks/record/x86_64/2023-08-25.csv -e jetscii/memmem/prebuilt -e rust/memchr/memmem/prebuilt --intersection
benchmark                                                   rust/jetscii/memmem/prebuilt  rust/memchr/memmem/prebuilt
---------                                                   ----------------------------  ---------------------------
memmem/byterank/binary                                      2.9 GB/s (1.50x)              4.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  6.9 GB/s (7.78x)              53.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren            6.8 GB/s (7.87x)              53.9 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux                      6.9 GB/s (8.22x)              56.4 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str                   7.0 GB/s (7.60x)              53.3 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty                 6.9 GB/s (7.66x)              52.6 GB/s (1.00x)
memmem/code/rust-library-common-fn                          5.7 GB/s (5.00x)              28.4 GB/s (1.00x)
memmem/code/rust-library-common-paren                       2.3 GB/s (2.12x)              4.8 GB/s (1.00x)
memmem/code/rust-library-common-let                         4.6 GB/s (4.30x)              19.8 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        5.8 GB/s (7.15x)              41.6 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash                      6.1 GB/s (8.07x)              49.3 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky               7.7 GB/s (8.15x)              63.2 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match                1351.7 MB/s (1.34x)           1811.6 MB/s (1.00x)
memmem/pathological/rare-repeated-small-tricky              7.2 GB/s (3.61x)              25.9 GB/s (1.00x)
memmem/pathological/rare-repeated-small-match               1305.9 MB/s (1.40x)           1821.8 MB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet           4.9 GB/s (1.00x)              4.1 GB/s (1.22x)
memmem/pathological/defeat-simple-vector-freq-alphabet      7.8 GB/s (2.48x)              19.2 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  137.3 MB/s (9.03x)            1239.8 MB/s (1.00x)
memmem/subtitles/common/huge-en-that                        4.6 GB/s (8.18x)              37.4 GB/s (1.00x)
memmem/subtitles/common/huge-en-you                         3.5 GB/s (4.42x)              15.2 GB/s (1.00x)
memmem/subtitles/common/huge-en-one-space                   590.8 MB/s (2.31x)            1365.2 MB/s (1.00x)
memmem/subtitles/common/huge-ru-that                        3.7 GB/s (10.09x)             37.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not                         1859.8 MB/s (8.28x)           15.0 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space                   954.4 MB/s (2.80x)            2.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that                        5.9 GB/s (6.58x)              38.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not                      3.6 GB/s (5.36x)              19.4 GB/s (1.00x)
memmem/subtitles/common/huge-zh-one-space                   1887.2 MB/s (2.54x)           4.7 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson                  7.7 GB/s (6.64x)              51.1 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes             6.3 GB/s (8.44x)              52.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes              7.7 GB/s (8.27x)              63.5 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space                    4.0 GB/s (15.79x)             63.6 GB/s (1.00x)
memmem/subtitles/never/teeny-en-john-watson                 1161.0 MB/s (1.53x)           1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-all-common-bytes            1161.0 MB/s (1.53x)           1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1112.6 MB/s (1.60x)           1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   1161.0 MB/s (1.53x)           1780.2 MB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  2.6 GB/s (24.44x)             63.7 GB/s (1.00x)
memmem/subtitles/never/teeny-ru-john-watson                 1602.2 MB/s (1.56x)           2.4 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  6.6 GB/s (9.10x)              60.1 GB/s (1.00x)
memmem/subtitles/never/teeny-zh-john-watson                 1285.4 MB/s (1.53x)           1970.9 MB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               7.5 GB/s (8.26x)              62.4 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock                      7.5 GB/s (8.04x)              60.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle                 6.3 GB/s (8.83x)              55.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle                   7.0 GB/s (6.38x)              44.5 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle                   7.6 GB/s (6.11x)              46.4 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              953.7 MB/s (1.65x)            1570.8 MB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock                     920.8 MB/s (1.38x)            1271.6 MB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               2.6 GB/s (24.19x)             63.4 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock                      2.6 GB/s (23.83x)             61.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              1381.2 MB/s (1.53x)           2.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock                     1381.2 MB/s (1.16x)           1602.2 MB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               5.6 GB/s (9.93x)              55.6 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock                      5.6 GB/s (6.97x)              38.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              1055.9 MB/s (1.00x)           1055.9 MB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     1055.9 MB/s (1.08x)           1137.1 MB/s (1.00x)

The only case that memchr::memmem does worse on is a pathological benchmark that I specifically constructed to defeat its heuristics. And even then, it does decently compared to pcmpestri. Same deal on aarch64 (which doesn't have anything like pcmpestri and thus I believe forces jetscii into naive substring search):


$ rebar cmp benchmarks/record/aarch64/2023-08-27.csv -e jetscii/memmem/prebuilt -e rust/memchr/memmem/prebuilt --intersection
benchmark                                                   rust/jetscii/memmem/prebuilt  rust/memchr/memmem/prebuilt
---------                                                   ----------------------------  ---------------------------
memmem/byterank/binary                                      335.2 MB/s (9.55x)            3.1 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  401.0 MB/s (77.75x)           30.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren            401.0 MB/s (76.23x)           29.9 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux                      401.0 MB/s (77.24x)           30.2 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str                   633.8 MB/s (47.66x)           29.5 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty                 401.0 MB/s (75.50x)           29.6 GB/s (1.00x)
memmem/code/rust-library-common-fn                          401.0 MB/s (47.33x)           18.5 GB/s (1.00x)
memmem/code/rust-library-common-paren                       393.9 MB/s (8.18x)            3.1 GB/s (1.00x)
memmem/code/rust-library-common-let                         386.2 MB/s (34.63x)           13.1 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        662.1 MB/s (39.62x)           25.6 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash                      644.1 MB/s (40.73x)           25.6 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky               418.4 MB/s (76.20x)           31.1 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match                1450.4 MB/s (1.37x)           1980.7 MB/s (1.00x)
memmem/pathological/rare-repeated-small-tricky              416.9 MB/s (54.52x)           22.2 GB/s (1.00x)
memmem/pathological/rare-repeated-small-match               1431.2 MB/s (1.33x)           1909.3 MB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet           359.3 MB/s (8.59x)            3.0 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-freq-alphabet      660.3 MB/s (23.52x)           15.2 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  173.8 MB/s (4.80x)            835.1 MB/s (1.00x)
memmem/subtitles/common/huge-en-that                        356.7 MB/s (45.56x)           15.9 GB/s (1.00x)
memmem/subtitles/common/huge-en-you                         392.6 MB/s (20.99x)           8.0 GB/s (1.00x)
memmem/subtitles/common/huge-en-one-space                   273.3 MB/s (2.63x)            717.6 MB/s (1.00x)
memmem/subtitles/common/huge-ru-that                        317.9 MB/s (60.57x)           18.8 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not                         245.8 MB/s (41.39x)           9.9 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space                   332.4 MB/s (2.96x)            984.3 MB/s (1.00x)
memmem/subtitles/common/huge-zh-that                        403.5 MB/s (49.09x)           19.3 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not                      358.9 MB/s (31.78x)           11.1 GB/s (1.00x)
memmem/subtitles/common/huge-zh-one-space                   382.4 MB/s (6.90x)            2.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson                  417.8 MB/s (75.68x)           30.9 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes             382.3 MB/s (61.10x)           22.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes              417.8 MB/s (75.68x)           30.9 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space                    304.7 MB/s (113.48x)          33.8 GB/s (1.00x)
memmem/subtitles/never/teeny-en-john-watson                 635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/never/teeny-en-all-common-bytes            635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/never/teeny-en-some-rare-bytes             635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   321.7 MB/s (83.00x)           26.1 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  644.9 MB/s (48.17x)           30.3 GB/s (1.00x)
memmem/subtitles/never/teeny-ru-john-watson                 953.7 MB/s (42.00x)           39.1 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  390.0 MB/s (76.61x)           29.2 GB/s (1.00x)
memmem/subtitles/never/teeny-zh-john-watson                 703.9 MB/s (42.00x)           28.9 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               414.8 MB/s (74.88x)           30.3 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock                      414.8 MB/s (75.52x)           30.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle                 637.5 MB/s (45.49x)           28.3 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle                   651.7 MB/s (51.53x)           32.8 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle                   636.1 MB/s (52.91x)           32.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock                     635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               644.9 MB/s (48.17x)           30.3 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock                      254.3 MB/s (121.56x)          30.2 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              953.7 MB/s (42.00x)           39.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock                     953.7 MB/s (42.00x)           39.1 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               640.8 MB/s (46.13x)           28.9 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock                      358.9 MB/s (84.32x)           29.6 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              721.1 MB/s (41.00x)           28.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     703.9 MB/s (42.00x)           28.9 GB/s (1.00x)

So at least with respect to substring search, it's hard to imagine any general circumstance in which you'd want to use jetscii over memchr::memmem. For searching for sets of bytes greater than 3, the benchmark results are a bit less clear with some minor open questions. I would say that Teddy's throughput appears to be definitively better than jetscii though. But if you're searching XML for delimiters, then you're in "latency sensitive" territory probably due to how frequent the delimiters are likely to occur. (Unless you commonly search XML documents with huge data relative to the markup.) In that case, my suspicion is that your best bet is to write a bespoke SIMD algorithm. If you don't want to go down that path, then I would pick a sample of representative XML documents and bake-off Teddy versus jetscii versus memchr versus something more naive.

BurntSushi added a commit to BurntSushi/aho-corasick that referenced this pull request Oct 20, 2023
There was some discussion about how to compare jetscii with Teddy and
some interesting benchmark results[1]. I decided to import the
benchmarks and see what things look like here.

[1]: shepmaster/jetscii#57
@shepmaster
Copy link
Owner

Thanks @BurntSushi, that is an awesome writeup! I'll try to respond to points that caught my eye or I think I can be useful on...

find_big_16_early_return

The differences of 369.9 MB/sec vs 2.7 KB/sec vs 74.9 MB/sec really feels like it has to be some kind of testing error just based on how huge a difference they are. 😉

At least the XML delimiter benchmarks look like they ought to be searching XML data

So my pet project since Rust 0.12 has been to write an XML parser and related machinery. Sometime in the last 15 years, I read a post (I want to say from Daniel Lemire, but I can't find it now) that talked about using PCMPSTRx for XML (and Wikipedia backs up my memory that those were even made to help XML). That's what spawned Jetscii (and cupid, peresil). My goal is all about making it go faster.

Amusingly, in my current rewrite, I use memchr[3]. On the XML file you posted, I parsed 1058794 tokens and wrote them back out to /dev/null in 78.7ms (roughly 90 MB/sec). My go-to testing file are Wikipedia dumps, and I have a 218M example that does 33428805 tokens in 1.7s (roughly 130MB/sec). xmllint (which isn't a one-to-one comparison) takes 2.1s.

but it is absolutely terrible for substring search.

Yep, this was never really a goal for the project, it was mostly a matter of someone (maybe me?) saying "oh you could use it for both cases!" and I added the code.

on aarch64 (which doesn't have anything like pcmpestri

Right, which is a partial reason I went with memchr in my current rewrite — my main development is on an M1 now, so any actual speed benefit from Jetscii won't help me anymore. Also, my brain doesn't really think in SIMD intrinsics, so I can't look at what aarch64 has available and construct something at all similar.

hard to imagine any general circumstance in which you'd want to use jetscii over memchr::memmem.

Totally makes sense to me.

your best bet is to write a bespoke SIMD algorithm

I've had some back-of-mind idea to perform some sort of en-masse lookup table... thing.

Specifically, in my new implementation, I have a fixed size buffer. One thing I could see doing is searching the entire buffer for every & / < / > / etc. and then stuff those results into a bitmask. I could then do some bit twiddling and quickly find the next relevant special character. I haven't done any benchmarking of this, nor do I have a great idea for an x86_64 and aarch64 SIMD way to find all those bytes in the first place. I'm certainly not hoping that the really smart people reading this get nerd-sniped into helping me solve that problem 😇.

@shepmaster
Copy link
Owner

@Dr-Emann What do you see as our next steps forward?

@Dr-Emann
Copy link
Collaborator Author

I think this PR is still a good first step, just some modernization, and benchmarking updates.

I think next would be replacing the substring search with memchr::memmem. Maybe deprecating it in favor of memmem entirely.

I think what I would like to do next would be a breaking change: to change the fallback implementation to fn(&[u8]) -> Option<usize> rather than fn(u8) -> bool, to avoid the dynamic call overhead for every char for the fallback case, which would at least make things a little more competitive on aarch64. I'd also like to select the algorithm at construction time rather than searching time, and maybe try adding a special case for 1, 2, and 3 chars to use memchr.

I don't really have any experience at all in making simd algorithms (yet?), so I don't have much input on the bespoke SIMD algorithm part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants