-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc Updates #57
base: main
Are you sure you want to change the base?
Misc Updates #57
Conversation
Here's the benchmark results for my machines.
Windows x64 Benchmark ResultsEnvironment details: Windows 10, Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
Macos M1 Pro Benchmarks
|
Hey, this is wonderful, thank you! All of the code changes look fine and I'd be happy to merge them. That being said...
This is quite surprising! I ran the current set of benchmarks on my Windows machine running Ubuntu inside of WSL:
Comparing to the criterion benchmarks:
Something seems quite suspicious as the speed of Jetscii is almost exactly half. Was the old benchmark flawed? Is criterion doing something different? |
Hmm, CI is indicating that these can't be const yet, as the closure may need to be dropped. To my knowledge, there's no way of specifying that we only want closures that will not be dropped. |
Interesting! In the case where we know we have SSE 4.2, we don't use the passed fallback at all, so it has to be dropped in the function. Probably can just do some trickery so the macro result is const (probably with a |
I personally think that's fine, I don't mind giving extra warnings in a minor version bump if the previous use was clearly useless. |
319143a
to
b3468b6
Compare
I agree, and I can reproduce on my x64 machine. It's really confusing, I don't see anything wrong with either benchmark, and I don't see any difference (within 0.2% difference) on my m1 mac, or if I force the fallback impl on the x64 machine, but I see the 2x difference if I force the sse implementation or let the runtime switch pick the simd implementation, so it's somehow specific to the sse implementation. |
AHAH! I figured it out: see e8ace35. Basically, just needed to add some |
Updated the benchmark results comment with updated results. |
I can confirm that the inlining improves the performance in practice R5 3600
i7-8665U
|
Ok, I think I have sorted out CI. I merged my branch and yours and added a little tweak to set the target, as you had it. That combined CI run was green. I think that means that you should be able to rebase on top of my changes, make the same tweak, and then we will be good for another review! |
Remove unneeded `extern crate`s
Also, add some benchmarks against memchr
Add a test that looks for the first item in a long haystack
The memmap crate is unmaintained, instead, use the maintained memmap2 crate
Structs don't need the bounds, only the implementations
Mostly just adding #[must_use]
This speeds up the criteron benchmarks by almost 2x I believe this is needed because e.g. Bytes::find is inlined, and calls `find` generically, which will call PackedCompareControl methods. So the code calling the methods will be inlined into the calling crate, but the implemetations of the PackedCompareControl are not accessable to the code in the calling crate, so they will end up as actual function calls. However these functions are _super_ simple, and inlining them helps a LOT, so adding `#[inline]` to these functions, and making their implementation available to calling crates has a huge effect. This was only seen when moving to criterion because previously, nightly benchmarks were implemented in the library crate itself, and so these functions were already elegable for inlining. Criteron results were actually more accurate to what callers of the crate would actually see!
Per suggestion from @BurntSushi [here](tafia/quick-xml#664 (comment)) On my M1, tt appears to be slower but competitive with memchr up to memchr3, then start being the from 5-16
I'm thinking of reverting the changes to make the constructor const, since we might want to end up e.g. using memchr's memmem searcher, which wouldn't be const, and I don't know if we want to make the constructors const. |
We may not want to be stuck with const-constructable implementations
The teddy results are pretty promising: on my machine, it seems to beat jetscii's simd implementation in everything but the "found a result on the first byte" test. I wonder if we can't just find a point and do |
@Dr-Emann Can you share your benchmark results? |
Sure: Skylake i5Including only ascii_chars, "teddy" and memchr
Somehow teddy got a lot faster going from 3 patterns to 5? M1 Mac
No big jump in teddy performance, but it does become the fastest still between 3 and 5 characters, for this case at least. |
Interesting. If I get a chance tomorrow, I'll port your benchmark into aho-corasick's rebar benchmark suite and see if I can do some analysis for you. |
There was some discussion about how to compare jetscii with Teddy and some interesting benchmark results[1]. I decided to import the benchmarks and see what things look like here. [1]: shepmaster/jetscii#57
There was some discussion about how to compare jetscii with Teddy and some interesting benchmark results[1]. I decided to import the benchmarks and see what things look like here. [1]: shepmaster/jetscii#57
All righty, here we go. I started by checking out this PR and running the benchmarks as written on my i9-12900K x86-64 CPU and my M2 mac mini aarch64 CPU. This should give us a comparison point with which to ground ourselves. On
And on
I then ported these benchmarks into aho-corasick's rebar benchmark suite. I didn't bother with porting the
And on
So the first thing that jumps out at me is that Teddy has pretty consistent timings for The other interesting thing is the discrepancy in timings for the "early return" benchmark in Criterion versus rebar. I did have to tweak them somewhat, since the benchmarks I have look for all counts instead of just the first one. (rebar doesn't require this, and I could make it only do one search, but it didn't seem worth it.) So I split it into two: one benchmark whose haystack is It is indeed somewhat common to try to optimize for the latency case by doing case analysis on the length of the haystack. One of the unfortunate things about latency sensitive workloads is that you are somewhat hosed already. You tend to burn a lot of time starting and stopping the search routine very frequently. This is why if you take just about any optimized search routine with high throughput and execute a search that has a high match count, that throughput will drop precipitously and you might not do much better (perhaps even worse) than the naive approach. The naive approach tends to do very well in latency sensitive workloads. I don't have any good guesses as to the reason for the discrepancy in the "early return" benchmarks between Criterion and rebar. It's worth pointing out that we're dealing with low-single-digit nanosecond timings here, so even a small amount of noise could explain this discprenancy. Another thing I did here was take a step back and look at the benchmark itself. At least the XML delimiter benchmarks look like they ought to be searching XML data. Instead, the benchmarks (except for the new "early return" one you added) are essentially all best cases for throughput: the haystack consists of the same byte repeated with no match, and finally followed by a single matching byte at the end. But what happens if we search a real XML document? I picked a fairly small XML document describing some mental health stuff and ran the same benchmarks on it for
And
Things here are quite a bit more competitive. My guess as to why this is, is because the benchmarks shift more towards latency sensitive workloads versus the pure throughput benchmarks in this PR. Essentially, as you move more and more towards latency sensitive workloads, differences between search routines tend to shrink, especially when comparing something that has very high throughput (Teddy) versus something that is less so (jetscii). Indeed, as I understand it, jetscii is based on the pcmpestri intrinsic from SSE4.2, and it's somewhat of a tortured technique because of its (documented) extremely high latency (18 cycles as written in the
The only case that
So at least with respect to substring search, it's hard to imagine any general circumstance in which you'd want to use jetscii over |
There was some discussion about how to compare jetscii with Teddy and some interesting benchmark results[1]. I decided to import the benchmarks and see what things look like here. [1]: shepmaster/jetscii#57
Thanks @BurntSushi, that is an awesome writeup! I'll try to respond to points that caught my eye or I think I can be useful on...
The differences of 369.9 MB/sec vs 2.7 KB/sec vs 74.9 MB/sec really feels like it has to be some kind of testing error just based on how huge a difference they are. 😉
So my pet project since Rust 0.12 has been to write an XML parser and related machinery. Sometime in the last 15 years, I read a post (I want to say from Daniel Lemire, but I can't find it now) that talked about using PCMPSTRx for XML (and Wikipedia backs up my memory that those were even made to help XML). That's what spawned Jetscii (and cupid, peresil). My goal is all about making it go faster. Amusingly, in my current rewrite, I use
Yep, this was never really a goal for the project, it was mostly a matter of someone (maybe me?) saying "oh you could use it for both cases!" and I added the code.
Right, which is a partial reason I went with
Totally makes sense to me.
I've had some back-of-mind idea to perform some sort of en-masse lookup table... thing. Specifically, in my new implementation, I have a fixed size buffer. One thing I could see doing is searching the entire buffer for every |
@Dr-Emann What do you see as our next steps forward? |
I think this PR is still a good first step, just some modernization, and benchmarking updates. I think next would be replacing the substring search with memchr::memmem. Maybe deprecating it in favor of memmem entirely. I think what I would like to do next would be a breaking change: to change the fallback implementation to I don't really have any experience at all in making simd algorithms (yet?), so I don't have much input on the bespoke SIMD algorithm part. |
extern crate
s#[must_use]