-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: use ragel options to minimize states and use gotos #2950
Conversation
It's nice to see someone looking at this! I played with the various options a few years ago and couldn't get better output. Looks like this choice of options ends up with 50k more lines of code to compile. I'm curious how this choice impacts the size of the binary, compilation time, and runtime parsing performance. |
Thanks, @jtarchie! I kicked off CI and have an HTML5 benchmark on my dev machine that I'll run a comparison with. |
There are file differences:
The object file shows the size difference that would affect the binary. This may not give the speed increase needed, based on the changelog ragel originally made this faster. This might have already hit the speed sweet spot. It might not be the bottleneck it once was. |
=Is this something I can help explore still? |
@jtarchie Thanks for asking. I definitely want to understand the tradeoffs between size, compilation speed (less of an issue these days), and runtime performance. It's on my list of things to explore next month when I have some free time to work on OSS. |
871ba5b
to
2746fe1
Compare
Rebased. |
2746fe1
to
3f212f5
Compare
After discussion, working on the requested benchmark to report speed (time) and memory. |
@stevecheckoway Do you know why Craig used gperf for foreign attributes but ragel for these character references? They seem very similar in purpose. |
@flavorjones No, I'm afraid I don't know for certain. If I had to guess, I'd say it was likely because attributes and tag names are completely known to the parser by the time the lookup occurs whereas that isn't the case for character references. For example, given the input In contrast, We can't do something like find everything of the form Edit: I'm not entirely sure why we have gumbo_normalize_svg_tagname(). It doesn't appear to be used for anything and its functionality seems to be completely subsumed by |
@stevecheckoway See #3402 for removal of |
**What problem is this PR intended to solve?** It's unused. See discussion at #2950
This does not make things faster. It is such a specific use case of ragel, after discussion with @flavorjones, should either be replaced with hash-map lookup or completely rewritten. Looks like it is old code no one ever touched again. :-) There still might be a way to use |
That's an interesting idea. You'd have to benchmark it to see how well it performs. From a quick scroll through the list of named character references, some are as long as 25 characters. An algorithm like when we see an &, perform hash look ups on the next 3 through 25 characters seems like it's unlikely to be a performance win over the state machine approach. The state machine only has to perform table lookups for each character so long as it's a prefix of a valid named character reference. But that's just a guess and anyway it doesn't mean a more clever algorithm couldn't do better. |
What problem is this PR intended to solve?
No problem.
Experimenting with the features of
ragel
for faster parsing tables forgumbo
.I'd like to see if the tests passed through CI.
Have you included adequate test coverage?
This shouldn't affect the functionality.
This relies on
ragel
and the C compiler for better optimizations in parsing.Does this change affect the behavior of either the C or the Java implementations?
As far as I know, just the C implementation.