-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async implementation of the AC automaton with futures crate (fully optional) #132
Conversation
Yes, please go open an issue. I haven't reviewed this PR yet, but I'm very unlikely to accept it. Unless you're submitting trivial PRs, you should absolutely open an issue first to avoid doing a bunch of work that has little chance of being accepted. |
You shouldn't need an entirely new implementation. This crate exposes the internal automatons in the dfa and nfa submodules. Adding async on top of those in a new crate will be a fair bit of work, but certainly less than starting from scratch. |
Thanks, I've opened the related issue. Let's continue the conversation there, however if you decide to not accept the PR don't worry about the work being lost, as we will probably end up using my fork with this implementation for our project, since we require this feature (and it works well). I do see a point where it might be considered out of the scope of this project. |
Cargo.toml
Outdated
@@ -20,6 +20,7 @@ name = "aho_corasick" | |||
[features] | |||
default = ["std", "perf-literal"] | |||
std = ["memchr?/std"] | |||
async = ["futures", "pin-project-lite"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip for the future: things like this should be dep:futures
and dep:pin-project-lite
. Otherwise, you've actually created two new implicit features here: futures
and pin-project-lite
.
Anchored::No, | ||
*this.sid, | ||
*byte, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost certainly a performance bug. You're calling next_state
on here on a Arc<dyn AcAutomaton>
. That means the function call isn't going to get inlined and things are likely to slow down considerably. It is perhaps good enough for you, but this would be a showstopper for me.
(I'm leaving these comments just as a review. I'm still not going to accept this PR, but I figure the feedback may be useful.)
I think there are basically two ways to fix this.
The first way is to implement async searching in the same way that the std::io::Read
searching is implemented. My guess is that you didn't go that route because it looked over-complicated, but it is done that way precisely to avoid dynamic dispatch inside the absolute most critical part of the code. If I were going to accept a PR like this, then this is the approach I would insist upon.
The second way is to write your async abstraction directly on top of one of the Aho-Corasick implementations. The nfa::contiguous::NFA
is almost certainly what you want. It's quick to build, very compact and has good search performance. You can get access to all of its low level methods via the Automaton
trait without needing dynamic dispatch because you only support that one implementation.
The second way is almost certainly a smaller diff between what you have here than the first way would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback greatly appreciated. Will dig into the first way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, but I do just want to make sure my point about the contiguous NFA is heard: it is almost certainly the case that you can just build stuff around the contiguous NFA and ignore the non-contiguous NFA and the DFA. The contiguous NFA is used in almost every circumstance anyway, is quite robust and quite fast. So by tailoring logic to be specific to one only Aho-Corasick implementation, you potentially reap big gains in code simplicity without giving up too much.
Off the top of my head, there are really only two ways that this would be inappropriate:
- Your pattern lists are absolutely enormous. So big that they can't fit in a contiguous NFA. Likely in the millions of patterns. In this case, you'd have to use a non-contiguous NFA.
- Your pattern lists are tiny. So tiny that building a DFA is not that expensive in an absolute sense. This is likely only true for less than 500 or 1000 patterns or so. In this case, you might want to use a DFA instead of a contiguous NFA for the best possible search performance.
If you fall outside those constraints, then it's likely you can very blissfully ignore everything except for the contiguous NFA.
This PR aims to add support to AhoCorasick for async handling using
AsyncRead
andAsyncWrite
traits from the standard futures crate.I wasn't sure whether to open an issue first, figured we could discuss the matter over this PR. I would be very grateful if you could spare some time reviewing it.
Motivation
Performing mass replacements in an async context is something we require, and seems very useful in general.
It seems at least one other person wanted it - #95, and I could not find any other existing alternative. Your replies on that issue did feel like you were opposed to the idea, which somewhat pushed me away from the idea of a contribution when our own needs arose for it, and instead went on and made my own simple and naive AC implementation with async support that we needed (covering the minimal features), and of course benchmarks were far from ideal (performing anywhere from 2 to 10 times worse than this this polished crate in a CPU-bound context). Despite of course the bottleneck of async processing being often the waiting time, I did want to improve performance in the CPU-bound case.
At this point, I thought it would be very silly to try and reimplement the optimizations that the current crate did.
So I've ported the async code we used to make use of the optimized automaton implemented in the current library, and thought that MAYBE you would not be opposed to the contribution hidden behind an entirely optional feature so it would do no harm.
Usage & async interface
If
async
feature is explicitly enabled, 3 new public methods will become available to AhoCorasick struct (documented with examples) :async_reader
: Produces an AhoCorasickAsyncReader which implements AsyncRead trait, wrapping user-provided AsyncRead source, and yielding chunks with replaced bytes when polling from itasync_writer
: Produces an AhoCorasickAsyncWriter which implements AsyncWrite trait, wrapping user-provided AsyncWrite sink, and similarily writing to it will write replaced chunks to the original sink.try_async_stream_replace_all
: A mostly standalone helper function which is somewhat trivial to implement using eitherasync_reader
orasync_writer
, however I though it would be convenient to have as an async alternative to the existingtry_stream_replace_all
method.Tests
A test module has been added to the async package, testing all the interface methods for some cases of replacements, partial matches, and various specific cases regarding chunks boundaries. All these tests run with different buffer sizes to confirm no issues with boundaries, and use custom Reader/Writers which are sometimes responding Poll::Pending for the first couple of polls, allowing to properly test repeated polls.
I'm open to additional ideas of tests that could be relevant