Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: use ragel options to minimize states and use gotos #2950

Closed
wants to merge 1 commit into from

Conversation

jtarchie
Copy link

@jtarchie jtarchie commented Aug 10, 2023

What problem is this PR intended to solve?

No problem.

Experimenting with the features of ragel for faster parsing tables for gumbo.
I'd like to see if the tests passed through CI.

Have you included adequate test coverage?

This shouldn't affect the functionality.
This relies on ragel and the C compiler for better optimizations in parsing.

Does this change affect the behavior of either the C or the Java implementations?

As far as I know, just the C implementation.

@stevecheckoway
Copy link
Contributor

It's nice to see someone looking at this!

I played with the various options a few years ago and couldn't get better output. Looks like this choice of options ends up with 50k more lines of code to compile.

I'm curious how this choice impacts the size of the binary, compilation time, and runtime parsing performance.

@flavorjones
Copy link
Member

Thanks, @jtarchie! I kicked off CI and have an HTML5 benchmark on my dev machine that I'll run a comparison with.

@jtarchie
Copy link
Author

jtarchie commented Aug 11, 2023

There are file differences:

  • char_ref.c goes from 692K -> 1.4M in size
  • the object file goes from 298K -> 870K
  • with -O3, the object file is 367K (taking 12s to compile over the 0.5s without any -O flag above)

The object file shows the size difference that would affect the binary.

This may not give the speed increase needed, based on the changelog ragel originally made this faster. This might have already hit the speed sweet spot. It might not be the bottleneck it once was.

@jtarchie
Copy link
Author

=Is this something I can help explore still?

@flavorjones
Copy link
Member

@jtarchie Thanks for asking. I definitely want to understand the tradeoffs between size, compilation speed (less of an issue these days), and runtime performance. It's on my list of things to explore next month when I have some free time to work on OSS.

@flavorjones
Copy link
Member

Rebased.

@jtarchie
Copy link
Author

jtarchie commented Jan 2, 2025

After discussion, working on the requested benchmark to report speed (time) and memory.

@flavorjones
Copy link
Member

@stevecheckoway Do you know why Craig used gperf for foreign attributes but ragel for these character references? They seem very similar in purpose.

@stevecheckoway
Copy link
Contributor

stevecheckoway commented Jan 3, 2025

@flavorjones No, I'm afraid I don't know for certain.

If I had to guess, I'd say it was likely because attributes and tag names are completely known to the parser by the time the lookup occurs whereas that isn't the case for character references. For example, given the input <feImage href="mdn_logo_only_color.png" /> (taken from here). By the time the parser calls gumbo_normalize_svg_tagname() adjust_svg_tag(), we know the tag name is feimage. Gperf lets us look that up in the hash table with exactly one probe.

In contrast, match_named_char_ref() is called by the tokenizer and needs to match the strings character by character. So rather than a hash look up, it's a state machine.

We can't do something like find everything of the form &blah; because some entities don't need semicolons.

Edit: I'm not entirely sure why we have gumbo_normalize_svg_tagname(). It doesn't appear to be used for anything and its functionality seems to be completely subsumed by adjust_svg_tag().

@flavorjones
Copy link
Member

@stevecheckoway See #3402 for removal of normalize_svg_tagname

flavorjones added a commit that referenced this pull request Jan 3, 2025
**What problem is this PR intended to solve?**

It's unused. See discussion at #2950
@jtarchie
Copy link
Author

jtarchie commented Jan 15, 2025

This does not make things faster. It is such a specific use case of ragel, after discussion with @flavorjones, should either be replaced with hash-map lookup or completely rewritten. Looks like it is old code no one ever touched again. :-)

There still might be a way to use gperf here with a state machine-like thing. We know the size of the different keys, iterate from the min length to the max length of possible keys, if the string doesn't exist in the gperf, move along.

@jtarchie jtarchie closed this Jan 15, 2025
@stevecheckoway
Copy link
Contributor

That's an interesting idea. You'd have to benchmark it to see how well it performs.

From a quick scroll through the list of named character references, some are as long as 25 characters. An algorithm like when we see an &, perform hash look ups on the next 3 through 25 characters seems like it's unlikely to be a performance win over the state machine approach. The state machine only has to perform table lookups for each character so long as it's a prefix of a valid named character reference.

But that's just a guess and anyway it doesn't mean a more clever algorithm couldn't do better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants