Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from sparklemotion:main #3

Open
wants to merge 1,754 commits into
base: main
Choose a base branch
from
Open

Conversation

pull[bot]
Copy link

@pull pull bot commented Oct 22, 2021

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Oct 22, 2021
@google-cla
Copy link

google-cla bot commented Oct 22, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

15 similar comments
@google-cla
Copy link

google-cla bot commented Oct 23, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Oct 23, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Oct 23, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Oct 26, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Oct 27, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Oct 28, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 3, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 3, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 6, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 6, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 12, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 15, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 15, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 15, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Nov 15, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

also:
- improve exception messages
- and make the exception messages consistent between Java and C impls
- and introduce some assert_pattern testing to simplify complex tests
and some other small nonfunctional changes
which will allow us to use libxml2's default SAX handlers if we wish,
and simplifies some things.
On CRuby, this fixes the fact that the parser was registering errors
when encountering general (non-predefined) entities. Now these
entities are resolved properly and converted into `#characters`
callbacks. Fixes #1926.

On JRuby, the SAX parser now respects the `#replace_entities`
attribute, which was previously ignored AND defaulted incorrectly to
`true`. The default now matches CRuby -- `false` -- and the parser
behavior matches CRuby with respect to entities. Fixes #614.

This commit also includes some granular tests of how the sax parser
handles different entities under different circumstances, which should
be clarifying for user reports like #1284 and #1500 that expect
predefined entities and character references to be treated like parsed
entities (which they aren't).
The behavior here is relatively complex, being a function of entity
type and `#replace_entities` value, but there are sufficient tests and
both Java and C impls behave identically.

Related to the problem described at #1926.
Previously, the Java impl callback passed an empty string for URI.
and add test coverage for HTML4 SAX parser exception handling
I'm probably doing something wrong, but I'm timing out on trying to
fix it.
Update docs for:

- XML::SAX
- XML::SAX::Document
- XML::SAX::ParserContext
- XML::SAX::PushParser
- HTML4::SAX::PushParser
flavorjones and others added 30 commits December 27, 2024 08:58
and jruby-head is still failing, skip rdoc/psych entirely
**What problem is this PR intended to solve?**

Now that both docker/ruby and setup-ruby have 3.4 releases, let's update
CI.
Introduce code that is an optimized version of libxml2's xmlNewNsProp
to avoid traversing the properties linked list to append an attribute
every time we add one.
If there are more than 16 attributes, shift from doing strcmp to using
a hashmap for duplicate detection. The number 16 was chosen based on
the benchmark in #2568

I've introduced @tidwall's hashmap.c (MIT licensed and the copyright
appropriately copied in the LICENSE-DEPENDENCIES file) to have
something self-contained within the libgumbo codebase, rather than
using libxml2's xmlHash or ruby's st.c.
to prevent it from being GCed before we parse it.
## What problem is this PR intended to solve?

Addressing quadratic performance of HTML5 attribute parsing as described
in #2568 and rubys/nokogumbo#144.

Closes #2568


### Some benchmarks

<details><summary>Here is the benchmark script.</summary>

```ruby
#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
  gem "benchmark-ips"
end

def html_with_attributes(count)
  html = "<div"
  count.times do |j|
    html << " fake-attr-#{j}"
  end
  html << ">"
end

html = {}
Benchmark.ips do |x|
  x.warmup = 0

  [ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 ].each do |attribute_count|
    html[attribute_count] = html_with_attributes(attribute_count)

    x.report "#{attribute_count.to_s.rjust(7)} attributes" do
      Nokogiri::HTML5(html[attribute_count], max_attributes: 100_000)
    end
  end
end
```

</details>

before:

```
Calculating -------------------------------------
        1 attributes    242.976k (± 9.3%) i/s    (4.12 μs/i) -    936.599k in   4.838811s
        2 attributes    219.456k (± 9.4%) i/s    (4.56 μs/i) -    837.644k in   4.858317s
        4 attributes    191.173k (± 9.3%) i/s    (5.23 μs/i) -    715.392k in   4.882336s
        8 attributes    144.813k (±12.2%) i/s    (6.91 μs/i) -    516.194k in   4.912577s
       16 attributes     83.396k (±17.2%) i/s   (11.99 μs/i) -    238.564k in   4.958331s
       32 attributes     55.800k (± 9.9%) i/s   (17.92 μs/i) -    190.871k in   4.970063s
       64 attributes     27.381k (±14.7%) i/s   (36.52 μs/i) -     91.867k in   4.981936s
      128 attributes     12.707k (± 9.4%) i/s   (78.70 μs/i) -     49.906k in   4.990254s
      256 attributes      4.723k (± 8.1%) i/s  (211.74 μs/i) -     19.657k in   4.995888s
      512 attributes      1.417k (±10.6%) i/s  (705.61 μs/i) -      6.339k in   4.998084s
     1024 attributes    382.078 (±11.5%) i/s    (2.62 ms/i) -      1.811k in   5.000356s
     2048 attributes    119.825 (±10.8%) i/s    (8.35 ms/i) -    582.000 in   5.003555s
     4096 attributes     29.034 (± 6.9%) i/s   (34.44 ms/i) -    144.000 in   5.003602s
     8192 attributes      6.492 (± 0.0%) i/s  (154.04 ms/i) -     33.000 in   5.111381s
    16384 attributes      1.478 (± 0.0%) i/s  (676.68 ms/i) -      8.000 in   5.417232s
```

after:

```
Calculating -------------------------------------
        1 attributes    229.084k (±10.4%) i/s    (4.37 μs/i) -    868.071k in   4.846385s
        2 attributes    213.788k (± 9.9%) i/s    (4.68 μs/i) -    804.683k in   4.859819s
        4 attributes    181.602k (±10.7%) i/s    (5.51 μs/i) -    652.740k in   4.886700s
        8 attributes    137.811k (± 9.9%) i/s    (7.26 μs/i) -    500.690k in   4.911176s
       16 attributes     95.462k (± 9.9%) i/s   (10.48 μs/i) -    338.045k in   4.938699s
       32 attributes     61.388k (± 9.5%) i/s   (16.29 μs/i) -    216.911k in   4.961587s
       64 attributes     34.368k (±10.2%) i/s   (29.10 μs/i) -    121.985k in   4.976219s
      128 attributes     17.315k (±10.9%) i/s   (57.75 μs/i) -     60.716k in   4.987542s
      256 attributes      9.011k (± 9.2%) i/s  (110.98 μs/i) -     32.443k in   4.992878s
      512 attributes      4.552k (±11.8%) i/s  (219.69 μs/i) -     16.522k in   4.995834s
     1024 attributes      2.255k (±13.6%) i/s  (443.38 μs/i) -      8.303k in   4.997861s
     2048 attributes      1.091k (±17.6%) i/s  (916.91 μs/i) -      4.132k in   4.998710s
     4096 attributes    544.840 (±22.0%) i/s    (1.84 ms/i) -      2.101k in   5.006943s
     8192 attributes    282.789 (±22.3%) i/s    (3.54 ms/i) -      1.089k in   5.000604s
    16384 attributes    134.991 (±27.4%) i/s    (7.41 ms/i) -    542.000 in   5.001360s
```

In graphical form:


![image](https://github.com/user-attachments/assets/c07ac065-0069-43ce-85e7-ed1cb4a86542)

Just the "after", showing linear performance now:


![image](https://github.com/user-attachments/assets/3051bcda-1b55-4270-8f14-f24eba0fb327)


### Analysis

Analyzing the parsing of a tag with 16000 attributes (using `samply`),
we see two hotspots:


![image](https://github.com/user-attachments/assets/3a2409e6-c394-4a73-a1b2-3373654e760d)

1. the `strlen`/`memcmp` checks in libgumbo's
`tokenizer.c:finish_attribute_name` which checks for duplicates in the
attribute list:
https://github.com/sparklemotion/nokogiri/blob/729c96c05f692d69e4d1d3be6d4b9524bdbe9322/gumbo-parser/src/tokenizer.c#L804-L820
2. libxml2's `xmlNewPropInternal` traversing the singly-linked
properties list to append each new property (to maintain parse order)

This PR addresses (1) by introducing a hashmap and using it if there are
16 or more attributes.

This PR addresses (2) by inlining most of `xmlNewNsProp` but keeping a
pointer to the last property to avoid re-traversing the linked list each
time.

After those changes, the flamegraph looks much better, with no obvious
hotspots:


![image](https://github.com/user-attachments/assets/36859e83-b35b-4489-af24-7f8c2f52935c)


### hashmap

Keeping in mind we may still someday want to publish our libgumbo fork
as a separate library, I opted to bring in a new self-contained hashmap
library from https://github.com/tidwall/hashmap.c

I'm using [xxHash3](https://github.com/Cyan4973/xxHash) which seems to
be pretty fast, but I'm open to other opinions on what hash function to
use.


## Have you included adequate test coverage?

Yes.


## Does this change affect the behavior of either the C or the Java
implementations?

HTML5 is CRuby-only.
**What problem is this PR intended to solve?**

I saw a failure in CI at
https://github.com/sparklemotion/nokogiri/actions/runs/12529532790/job/34945362057?pr=3393

```
  1) Error:
Nokogiri::HTML4::SAX::ParserContext::constructor::encoding::.io#test_0001_supports passing encoding name:
NotImplementedError: method `read' called on terminated object (0x0000005520809790)
    test/html4/sax/test_parser_context.rb:60:in `parse_with'
    test/html4/sax/test_parser_context.rb:60:in `block (5 levels) in <module:SAX>'
```

So now the class keeps a reference to the input object to prevent it
from being GCed before we parse it.


**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java
implementations?**

N/A
now that rubyinstaller has been released
**What problem is this PR intended to solve?**

bump windows test images from "head" to "3.4" now that rubyinstaller has
been released
As of Sanitize 7.0.0, tests are passing again with the latest Nokogiri.
**What problem is this PR intended to solve?**

In #3348, downstream CI tests for Sanitize were temporarily disabled
because the changes in that PR (intentionally) caused some of Sanitize's
tests to fail.

As of Sanitize 7.0.0, tests are passing again with the latest Nokogiri
and it's safe to re-enable the downstream tests. 🎉
which updates to Ruby 3.4.1
**What problem is this PR intended to solve?**

It's unused. See discussion at #2950
Updates the requirements on [ruby_memcheck](https://github.com/Shopify/ruby_memcheck) to permit the latest version.
- [Release notes](https://github.com/Shopify/ruby_memcheck/releases)
- [Commits](Shopify/ruby_memcheck@3.0.0...3.0.1)

---
updated-dependencies:
- dependency-name: ruby_memcheck
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
- drop adding the flat namespace linker flag which rcd injects for us
- detect the need for darwin linker hack based on linker flags, not ruby version
**What problem is this PR intended to solve?**

Bump rake-compiler-dock to 1.8.0, see
https://github.com/rake-compiler/rake-compiler-dock/releases/tag/v1.8.0
Updates the requirements on [rake-compiler](https://github.com/luislavena/rake-compiler) to permit the latest version.
- [Release notes](https://github.com/luislavena/rake-compiler/releases)
- [Changelog](https://github.com/rake-compiler/rake-compiler/blob/master/History.md)
- [Commits](rake-compiler/rake-compiler@v1.2.8...v1.2.9)

---
updated-dependencies:
- dependency-name: rake-compiler
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
Updates the requirements on [rubyzip](https://github.com/rubyzip/rubyzip) to permit the latest version.
- [Release notes](https://github.com/rubyzip/rubyzip/releases)
- [Changelog](https://github.com/rubyzip/rubyzip/blob/master/Changelog.md)
- [Commits](rubyzip/rubyzip@v2.3.2...v2.4.1)

---
updated-dependencies:
- dependency-name: rubyzip
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
**What problem is this PR intended to solve?**

Recently we've been getting upstream CI failures for libxml2 head on
ubuntu:
https://github.com/sparklemotion/nokogiri/actions/runs/12705536326/job/35416744559

```
configure:15841: checking for liblzma
configure:15848: $PKG_CONFIG --exists --print-errors "liblzma"
Package liblzma was not found in the pkg-config search path.
Perhaps you should add the directory containing `liblzma.pc'
to the PKG_CONFIG_PATH environment variable
Package 'liblzma', required by 'virtual:world', not found
configure:15851: $? = 1
configure:15865: $PKG_CONFIG --exists --print-errors "liblzma"
*** ../../../../ext/nokogiri/extconf.rb failed ***
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants