-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expose more primitive operations #13
Comments
All of this absolutely makes sense. I actually thought there was an issue for this already, as I'm sure someone else wanted it.
Yeah, I'd probably just set up parallel structures that allow you to do whatever you want with bytes. This would allow folk to use it for binary files where you need to find For my own education, will you actually be searching for non-ASCII characters? If so, are you just going to search for the first byte of a multiple-byte UTF8 character and then do a sanity check to make sure it was the right one? |
@shepmaster
With that said...
Hope that makes sense. Happy to answer more questions! |
@BurntSushi what kind of API would you like to see? Strings are nice in that they have the let needle = some_macro_name!(needle_byte_1, needle_byte_2, ...);
needle.position(haystack); But it would probably feel more natural to have |
@shepmaster Hmm. The absolute simplest API I can think of is: I suspect what you're going to say is that this adds some kind of overhead to every search since the bytes in struct PcmpSearcher {
// some packed representation?
}
impl PcmpSearcher {
// Create a new PcmpNeedles with the bytes given.
// This panics if needles.len() > 16.
// Duplicate bytes are ignored. (Seems suspicious.)
fn new(needles: &[u8]) -> Self { ... }
fn position(&self, haystack: &[u8]) -> Option<usize> { ... }
} (Naming isn't my specialty. :P) You could even impl Other thoughts:
|
@shepmaster Thanks for working with me on this btw. I actually think this might help |
The macro is to allow for fallback cases where the intrinsic operations aren't available. Technically, I could make a few small changes and release Jetscii right now and it would work on stable Rust. It just isn't worth it because the fallback would be the only supported case! Right now, when you call something like: part_number.split(ascii_chars!('-', ':')).collect(); That macro expands into the packed structures needed for the intrinsic as well as an optimized closure that checks for those two bytes "manually" ( This all corresponds to If you compile with the right features (which the docs don't), then there are methods on It's possible that you'd prefer to have your own fallback code for when the intrinsics aren't available. If so, then you'd be able to use the raw-bytes equivalent of I did think about the one-shot function, and it's something I'd be open to if enough people wanted it, but I look at it as a bit of antipattern - you probably don't want to call it in a tight loop, much like The |
Absolutely! Having people use the library is the only way it will get better! And the faster Rust is, the more likely I am to get paid to write it for a living! 😇 |
Hmm, right, so then I definitely agree that the one-shot case should just be completely thrown out the window. Encouraging folks to write slow code is bad (and is one reason why, e.g., I want to remove the free function With that said... you're right about the fallback. I definitely want to be able to control, up front, which type of prefix searching is used. Sometimes it's just I think the fallback code is nice from a usability perspective, but I do also believe there should be a way to execute pcmpstri with as few assumptions as possible. I think that's what I'd like most. Now, if for some reason it's impossible to know whether a fallback is needed up front, then that could be a problem! (I feel like there shouldn't be, but my understanding of the pcmpstri instruction is extremely limited.) With all that said, if you feel strongly about tying it to a fallback in the public API, then I can cope and write my own fallback routine. :-) (For example, I'd use |
I don't think I am doing a great job explaining, but hopefully code will help. Grab the bytes branch. You should be able to use extern crate jetscii;
use jetscii::bytes::Bytes;
fn main() {
let words = include_bytes!("/tmp/words.gz");
let mut needle = Bytes::new();
needle.push(0x80);
println!("{:?}", needle.position(words));
} Make sure you compile Jetscii with The big thing is that
The fallback code I'm talking about makes it so a consumer of the library doesn't need to know about those concerns. You want to care about that, and this is the current way I expose it. |
@shepmaster Code is good, thanks for that. And it has made me realize that I had a major brain fart. For whatever reason (I blame lack of sleep), I was thinking that the fallback was used when a caller added more than 16 bytes to the needle, but now of course I realize that's silly. I forgot about the pcmp instructions not being supported everywhere! So with that said... Yes, I would use |
Hmm. That makes sense, but could be a bit tricky. Right now, the fallback takes a In fact, a lot of my existing usage has centered around a set of characters fixed at compile time. For example, XML parsing can benefit from looking for the first of In your case, you'd rather a fallback that operates more like I can certainly add something like that, but it starts to feel a bit circular. What would you actually be using for your fallback? I'd assume something like |
Yes, absolutely. :-) The @huonw - I wonder if you might have any special knowledge in this space, since I know you've been working on simd stuff. (I've looked at it, but can't quite grok how to use it for substring/byte matching!) |
One thing I notice is that your performance numbers for
I'm not even sure why you'd want to switch! |
@shepmaster Hehe, well that's To be clear, when there's a single leading byte, I won't use jetscii---I'll just use memchr. But if there's more than that.... :-) |
Here are the latest benchmarks in |
@shepmaster Anything I can do to help move this along? :-) |
@BurntSushi Let's try to get this started again! Maybe one thing that would help me would be if you could point (or hand-wave) to where ever some jetscii code would fit in. I do see there's some SIMD stuff in regex already, so don't feel bad telling me that I'm just too late to catch the train with the cool kids 😊 |
@shepmaster The SIMD stuff did relieve some of the pressure, but I bet |
Thank you! The code is pretty easy to understand; very nice! I was on a plane and didn't have the benchmark suite downloaded, but the test suite takes 94% of the previous time with the Jetscii stuff added; an encouraging sign. |
@BurntSushi care to give this an appraisal?
|
poke @BurntSushi |
@shepmaster Can you show the code you used to generate those benchmarks? It looks like the improvement to |
Oh, yes, that was silly of me to omit. Here's my regex branch. You'll see that it's basically a direct clone of the existing types. I also have changes to jetscii, which basically boil down to copy-paste-remove checks.
Nope! That's one of many reasons why your expertise is requested 😜. |
As I understand it, the public API currently requires that haystacks be
&str
and that all needles be ASCII. Might you consider exposing a way to run a search on a&[u8]
? And possibly also exposing a way to use arbitrary bytes instead of limiting it to ASCII? (I feel less strongly about the latter point than the former point.)The specific reason why I'd want this is for use in
regex
. In particular, the internal matching engines can work on either&str
or&[u8]
, so the prefix literal scanning operates on&[u8]
so that it can be used with any of the matching engines. Since all inputs originate with&str
(currently), I could unsafely transmute to&[u8]
before calling out to jetscii, but I don't think that assumption will always hold in the future and the use of unsafe there seems a little unfortunate.Other ideas? Thoughts?
The text was updated successfully, but these errors were encountered: