-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] main from sparklemotion:main #3
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
15 similar comments
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
also: - improve exception messages - and make the exception messages consistent between Java and C impls - and introduce some assert_pattern testing to simplify complex tests
and some other small nonfunctional changes
which will allow us to use libxml2's default SAX handlers if we wish, and simplifies some things.
On CRuby, this fixes the fact that the parser was registering errors when encountering general (non-predefined) entities. Now these entities are resolved properly and converted into `#characters` callbacks. Fixes #1926. On JRuby, the SAX parser now respects the `#replace_entities` attribute, which was previously ignored AND defaulted incorrectly to `true`. The default now matches CRuby -- `false` -- and the parser behavior matches CRuby with respect to entities. Fixes #614. This commit also includes some granular tests of how the sax parser handles different entities under different circumstances, which should be clarifying for user reports like #1284 and #1500 that expect predefined entities and character references to be treated like parsed entities (which they aren't).
The behavior here is relatively complex, being a function of entity type and `#replace_entities` value, but there are sufficient tests and both Java and C impls behave identically. Related to the problem described at #1926.
Previously, the Java impl callback passed an empty string for URI.
and add test coverage for HTML4 SAX parser exception handling
I'm probably doing something wrong, but I'm timing out on trying to fix it.
Should have been part of 46c1e10
Update docs for: - XML::SAX - XML::SAX::Document - XML::SAX::ParserContext - XML::SAX::PushParser - HTML4::SAX::PushParser
and jruby-head is still failing, skip rdoc/psych entirely
**What problem is this PR intended to solve?** Now that both docker/ruby and setup-ruby have 3.4 releases, let's update CI.
Introduce code that is an optimized version of libxml2's xmlNewNsProp to avoid traversing the properties linked list to append an attribute every time we add one.
[skip ci]
If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.
to prevent it from being GCed before we parse it.
## What problem is this PR intended to solve? Addressing quadratic performance of HTML5 attribute parsing as described in #2568 and rubys/nokogumbo#144. Closes #2568 ### Some benchmarks <details><summary>Here is the benchmark script.</summary> ```ruby #!/usr/bin/env ruby require "bundler/inline" gemfile do source "https://rubygems.org" gem "nokogiri", path: "." gem "benchmark-ips" end def html_with_attributes(count) html = "<div" count.times do |j| html << " fake-attr-#{j}" end html << ">" end html = {} Benchmark.ips do |x| x.warmup = 0 [ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 ].each do |attribute_count| html[attribute_count] = html_with_attributes(attribute_count) x.report "#{attribute_count.to_s.rjust(7)} attributes" do Nokogiri::HTML5(html[attribute_count], max_attributes: 100_000) end end end ``` </details> before: ``` Calculating ------------------------------------- 1 attributes 242.976k (± 9.3%) i/s (4.12 μs/i) - 936.599k in 4.838811s 2 attributes 219.456k (± 9.4%) i/s (4.56 μs/i) - 837.644k in 4.858317s 4 attributes 191.173k (± 9.3%) i/s (5.23 μs/i) - 715.392k in 4.882336s 8 attributes 144.813k (±12.2%) i/s (6.91 μs/i) - 516.194k in 4.912577s 16 attributes 83.396k (±17.2%) i/s (11.99 μs/i) - 238.564k in 4.958331s 32 attributes 55.800k (± 9.9%) i/s (17.92 μs/i) - 190.871k in 4.970063s 64 attributes 27.381k (±14.7%) i/s (36.52 μs/i) - 91.867k in 4.981936s 128 attributes 12.707k (± 9.4%) i/s (78.70 μs/i) - 49.906k in 4.990254s 256 attributes 4.723k (± 8.1%) i/s (211.74 μs/i) - 19.657k in 4.995888s 512 attributes 1.417k (±10.6%) i/s (705.61 μs/i) - 6.339k in 4.998084s 1024 attributes 382.078 (±11.5%) i/s (2.62 ms/i) - 1.811k in 5.000356s 2048 attributes 119.825 (±10.8%) i/s (8.35 ms/i) - 582.000 in 5.003555s 4096 attributes 29.034 (± 6.9%) i/s (34.44 ms/i) - 144.000 in 5.003602s 8192 attributes 6.492 (± 0.0%) i/s (154.04 ms/i) - 33.000 in 5.111381s 16384 attributes 1.478 (± 0.0%) i/s (676.68 ms/i) - 8.000 in 5.417232s ``` after: ``` Calculating ------------------------------------- 1 attributes 229.084k (±10.4%) i/s (4.37 μs/i) - 868.071k in 4.846385s 2 attributes 213.788k (± 9.9%) i/s (4.68 μs/i) - 804.683k in 4.859819s 4 attributes 181.602k (±10.7%) i/s (5.51 μs/i) - 652.740k in 4.886700s 8 attributes 137.811k (± 9.9%) i/s (7.26 μs/i) - 500.690k in 4.911176s 16 attributes 95.462k (± 9.9%) i/s (10.48 μs/i) - 338.045k in 4.938699s 32 attributes 61.388k (± 9.5%) i/s (16.29 μs/i) - 216.911k in 4.961587s 64 attributes 34.368k (±10.2%) i/s (29.10 μs/i) - 121.985k in 4.976219s 128 attributes 17.315k (±10.9%) i/s (57.75 μs/i) - 60.716k in 4.987542s 256 attributes 9.011k (± 9.2%) i/s (110.98 μs/i) - 32.443k in 4.992878s 512 attributes 4.552k (±11.8%) i/s (219.69 μs/i) - 16.522k in 4.995834s 1024 attributes 2.255k (±13.6%) i/s (443.38 μs/i) - 8.303k in 4.997861s 2048 attributes 1.091k (±17.6%) i/s (916.91 μs/i) - 4.132k in 4.998710s 4096 attributes 544.840 (±22.0%) i/s (1.84 ms/i) - 2.101k in 5.006943s 8192 attributes 282.789 (±22.3%) i/s (3.54 ms/i) - 1.089k in 5.000604s 16384 attributes 134.991 (±27.4%) i/s (7.41 ms/i) - 542.000 in 5.001360s ``` In graphical form: ![image](https://github.com/user-attachments/assets/c07ac065-0069-43ce-85e7-ed1cb4a86542) Just the "after", showing linear performance now: ![image](https://github.com/user-attachments/assets/3051bcda-1b55-4270-8f14-f24eba0fb327) ### Analysis Analyzing the parsing of a tag with 16000 attributes (using `samply`), we see two hotspots: ![image](https://github.com/user-attachments/assets/3a2409e6-c394-4a73-a1b2-3373654e760d) 1. the `strlen`/`memcmp` checks in libgumbo's `tokenizer.c:finish_attribute_name` which checks for duplicates in the attribute list: https://github.com/sparklemotion/nokogiri/blob/729c96c05f692d69e4d1d3be6d4b9524bdbe9322/gumbo-parser/src/tokenizer.c#L804-L820 2. libxml2's `xmlNewPropInternal` traversing the singly-linked properties list to append each new property (to maintain parse order) This PR addresses (1) by introducing a hashmap and using it if there are 16 or more attributes. This PR addresses (2) by inlining most of `xmlNewNsProp` but keeping a pointer to the last property to avoid re-traversing the linked list each time. After those changes, the flamegraph looks much better, with no obvious hotspots: ![image](https://github.com/user-attachments/assets/36859e83-b35b-4489-af24-7f8c2f52935c) ### hashmap Keeping in mind we may still someday want to publish our libgumbo fork as a separate library, I opted to bring in a new self-contained hashmap library from https://github.com/tidwall/hashmap.c I'm using [xxHash3](https://github.com/Cyan4973/xxHash) which seems to be pretty fast, but I'm open to other opinions on what hash function to use. ## Have you included adequate test coverage? Yes. ## Does this change affect the behavior of either the C or the Java implementations? HTML5 is CRuby-only.
**What problem is this PR intended to solve?** I saw a failure in CI at https://github.com/sparklemotion/nokogiri/actions/runs/12529532790/job/34945362057?pr=3393 ``` 1) Error: Nokogiri::HTML4::SAX::ParserContext::constructor::encoding::.io#test_0001_supports passing encoding name: NotImplementedError: method `read' called on terminated object (0x0000005520809790) test/html4/sax/test_parser_context.rb:60:in `parse_with' test/html4/sax/test_parser_context.rb:60:in `block (5 levels) in <module:SAX>' ``` So now the class keeps a reference to the input object to prevent it from being GCed before we parse it. **Have you included adequate test coverage?** Yes. **Does this change affect the behavior of either the C or the Java implementations?** N/A
[skip ci]
now that rubyinstaller has been released
**What problem is this PR intended to solve?** bump windows test images from "head" to "3.4" now that rubyinstaller has been released
As of Sanitize 7.0.0, tests are passing again with the latest Nokogiri.
**What problem is this PR intended to solve?** In #3348, downstream CI tests for Sanitize were temporarily disabled because the changes in that PR (intentionally) caused some of Sanitize's tests to fail. As of Sanitize 7.0.0, tests are passing again with the latest Nokogiri and it's safe to re-enable the downstream tests. 🎉
which updates to Ruby 3.4.1
It's unused.
**What problem is this PR intended to solve?** It's unused. See discussion at #2950
Updates the requirements on [ruby_memcheck](https://github.com/Shopify/ruby_memcheck) to permit the latest version. - [Release notes](https://github.com/Shopify/ruby_memcheck/releases) - [Commits](Shopify/ruby_memcheck@3.0.0...3.0.1) --- updated-dependencies: - dependency-name: ruby_memcheck dependency-type: direct:development ... Signed-off-by: dependabot[bot] <[email protected]>
- drop adding the flat namespace linker flag which rcd injects for us - detect the need for darwin linker hack based on linker flags, not ruby version
**What problem is this PR intended to solve?** Bump rake-compiler-dock to 1.8.0, see https://github.com/rake-compiler/rake-compiler-dock/releases/tag/v1.8.0
Updates the requirements on [rake-compiler](https://github.com/luislavena/rake-compiler) to permit the latest version. - [Release notes](https://github.com/luislavena/rake-compiler/releases) - [Changelog](https://github.com/rake-compiler/rake-compiler/blob/master/History.md) - [Commits](rake-compiler/rake-compiler@v1.2.8...v1.2.9) --- updated-dependencies: - dependency-name: rake-compiler dependency-type: direct:development ... Signed-off-by: dependabot[bot] <[email protected]>
Updates the requirements on [rubyzip](https://github.com/rubyzip/rubyzip) to permit the latest version. - [Release notes](https://github.com/rubyzip/rubyzip/releases) - [Changelog](https://github.com/rubyzip/rubyzip/blob/master/Changelog.md) - [Commits](rubyzip/rubyzip@v2.3.2...v2.4.1) --- updated-dependencies: - dependency-name: rubyzip dependency-type: direct:development ... Signed-off-by: dependabot[bot] <[email protected]>
**What problem is this PR intended to solve?** Recently we've been getting upstream CI failures for libxml2 head on ubuntu: https://github.com/sparklemotion/nokogiri/actions/runs/12705536326/job/35416744559 ``` configure:15841: checking for liblzma configure:15848: $PKG_CONFIG --exists --print-errors "liblzma" Package liblzma was not found in the pkg-config search path. Perhaps you should add the directory containing `liblzma.pc' to the PKG_CONFIG_PATH environment variable Package 'liblzma', required by 'virtual:world', not found configure:15851: $? = 1 configure:15865: $PKG_CONFIG --exists --print-errors "liblzma" *** ../../../../ext/nokogiri/extconf.rb failed *** ```
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )