Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to get size of compiled regex #943

Open
bentheiii opened this issue Jan 6, 2023 · 6 comments
Open

API to get size of compiled regex #943

bentheiii opened this issue Jan 6, 2023 · 6 comments

Comments

@bentheiii
Copy link

As an extension of the RegexBuilder::size_limit function, it might be useful to know the actual size taken by a compiled regular expression. A possible use for this is for when you want to process a number of untrusted regular expressions, and be assured that collectively they don't exceed a limit.

I propose that the size of the compiled program (the same one used when checking against the size_limit of the builder) be stored inside a regular expression struct, and be accessible through a new function Regex::approximate_size.

@BurntSushi
Copy link
Member

A possible use for this is for when you want to process a number of untrusted regular expressions, and be assured that collectively they don't exceed a limit.

Could you elaborate on this? If you're processing untrusted regexes, then wouldn't RegexBuilder::size_limit prevents ones that are too big from compiling in the first place?

@bentheiii
Copy link
Author

@BurntSushi I'm referring to a case where an untrusted user submits a number of patterns individually, and you want to make sure they do not exceed a "budget" for all of them.

For example if you wanted to set a limit of 10,000 bytes and to allow the users to submit over 100 patterns, you'd need to set the per-pattern limit to 100 bytes, which is not ideal.

@BurntSushi
Copy link
Member

I can probably make this happen once #656 lands. Although, the precise relationship between the size limit and the bytes reported by this function may be tricky to get right. It probably does make sense from a certain perspective, but the problem is in how the limits are enforced. The limit tends to be enforced on a per-internal-NFA-graph basis, where as the natural thing to do for an "approximate memory usage" function would be to report the sum of everything using heap memory. Then there is the mutable scratch space to think about as well.

Now we could just say, "no the approximate memory usage function should be specifically scoped to precisely what RegexBuilder::size_limit is applied to." But then it might become quite a misleading function where folks might just assume it reports the total memory usage. Maybe a better name could help eliminate this confusion. I'm not sure.

@bentheiii
Copy link
Author

bentheiii commented Jan 6, 2023

Perhaps a clearer name would be built_size or compiled_size?

@fuchsnj
Copy link

fuchsnj commented Jan 26, 2024

This API would have been very helpful for me. In my case, I had a sample of a few thousand regex patterns that would be used, and I needed to determine what size limit should be used. To do that, I wanted to know how large all of these sample patterns were to see how close to the limit they were. I did it with a binary search of the builder (to see if it compiled or not) but that obviously wasn't ideal.

@BurntSushi
Copy link
Member

@fuchsnj I don't think it will ever be possible for an API like the one you want to exist in such a granular way. The size limit really is just an approximation. And the binary search thing that you've done is exactly what I do in such cases as well. (Although I tend to just sent a very high size limit if I'm trying thousands of patterns. You're likely in for a pretty bad time in that case no matter what you do.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants