Enable the `huge_tree` option for the `lxml` parser #3365

seberm · 2024-11-18T10:20:03Z

This PR should resolve the problem with the LXML limit on the size of text nodes it can handle:

Polarion XML fails to generate due to xmlSAX2Characters: huge text node #3363

For more info about a huge_tree option, please see:

https://lxml.de/api/lxml.etree.XMLParser-class.html

Especially this LXML FAQ section about security concerns:

https://lxml.de/FAQ.html#is-lxml-vulnerable-to-xml-bombs

Is lxml vulnerable to XML bombs?

This has nothing to do with lxml itself, only with the parser of libxml2. Since libxml2 version 2.7, the parser imposes hard security limits on input documents to prevent DoS attacks with forged input data. Since lxml 2.2.1, you can disable these limits with the huge_tree parser option if you need to parse really large, trusted documents. All lxml versions will leave these restrictions enabled by default.

Note that libxml2 versions of the 2.6 series do not restrict their parser and are therefore vulnerable to DoS attacks.

TODOs:

Try to test the test with huge output without lxml (with just jinja) to check if bottleneck is really lxml (schema or pretty print)
Enable the "huge output" test in the CI? -> yes

Pull Request Checklist:

seberm · 2024-11-18T10:32:35Z

Hello @thrix, @happz, @psss ,
Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

seberm · 2024-11-18T11:53:01Z

@KwisatzHaderach Are you able to somehow test this change and let me know if it resolves your problem?

The processing of huge files using LXML can take a lot of resources (esp. CPU). We can always add an option/condition to completely bypass the LXML processing and use just Jinja2. The LXML is used to just check the XML schema for non-custom JUnit flavors and for prettifying the XML output.

tests/report/junit/data/main.fmf

thrix · 2024-11-19T13:20:54Z

Hello @thrix, @happz, @psss , Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

@seberm fine with me, the XML is not something user can easily inject I assume, if he wants to DOS us he has a lot of other options directly from the tests anyway.

KwisatzHaderach

didn't get a chance to test it yet, but approving since it looks good and we need this sooner rather than later

seberm · 2024-11-19T22:33:50Z

So I gave it an another round of tests. Let's take following test as an example:

require:
  - python
result: fail
test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)"

As you can see from the results below, enabling the huge_tree=True option makes sense, and I think it will help solve the #3363 issue.

1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnit

Takes around 43s on my machine to generate, but it works.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    43.01s user 2.08s system 71% cpu 1:03.41 total

2) LXML enabled (schema validation + pretty print), `huge_tree=False`

Takes around 44s on my machine, it crashes on huge text node error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.30s user 2.47s system 69% cpu 1:06.87 total

3) LXML enabled (schema validation + pretty print), `huge_tree=True`

Takes also around 44s on my machine, it works without error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.22s user 2.20s system 71% cpu 1:04.93 total

(another?) problem I was facing

If you change the test to the following command (similar amount of data as in previous test but no line ending):

test: python -c "print('a' * (10 * 1024 * 1024 + 1))"

The tmt is still "working" on generating the report and it seems like it will never finish. This behavior is the same with or without the LXML installed.

I would appreciate it if someone could help me to profile this test in the tmt, so we could decide if this is actually a problem that needs to be solved. Perhaps not? Any ideas?

seberm · 2024-11-20T15:59:16Z

@psss @happz Do you think it's a good idea to enable the test mentioned above (test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)") in the CI? I am not sure if it's a good idea to slow down the CI in this way. Thanks.

martinhoyer

If I understand it correctly, we want large depth (above 256) to require huge_tree.

We played with it locally with:

DEPTH=257
OUTPUT_FILE="deep.xml"

# Start the XML file with a root tag
echo "<root>" > "$OUTPUT_FILE"

# Generate nested tags
for i in $(seq 1 $DEPTH); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "<tag$i>" >> "$OUTPUT_FILE"
done

# Close the nested tags in reverse order
for i in $(seq $DEPTH -1 1); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "</tag$i>" >> "$OUTPUT_FILE"
done

# Close the root tag
echo "</root>" >> "$OUTPUT_FILE"

And then I thought we could catch the exception and only use huge_tree, when it's necessary?

from lxml import etree

def parse_with_huge_tree(file_path):
    try:
        tree = etree.parse(file_path)
        print("XML parsed successfully (default settings).")
        return tree
    except etree.LxmlError as e:
        print(f"LxmlError encountered: {e}")
        print("Retrying with huge_tree=True...")
        try:
            parser = etree.XMLParser(huge_tree=True)
            tree = etree.parse(file_path, parser)
            print("XML parsed successfully with huge_tree=True.")
            return tree
        except etree.LxmlError as retry_error:
            print(f"Retry failed with huge_tree=True: {retry_error}")
        except Exception as retry_exception:
            print(f"Unexpected error during retry: {retry_exception}")
    except Exception as e:
        print(f"Unexpected error: {e}")

# Usage
parse_with_huge_tree('./deep.xml')

Make sense?

(and maybe print a warning that it's quite large and we are being forced to use huge_tree, security vulnerability, et cetera)

seberm added plugin | junit The junit report plugin plugin | reportportal The reportportal report plugin labels Nov 18, 2024

seberm added this to the 1.40 milestone Nov 18, 2024

seberm requested review from psss, lukaszachy, happz, thrix and janhavlin as code owners November 18, 2024 10:20

seberm added the ci | full test Pull request is ready for the full test execution label Nov 18, 2024

seberm linked an issue Nov 18, 2024 that may be closed by this pull request

Polarion XML fails to generate due to xmlSAX2Characters: huge text node #3363

Open

thrix reviewed Nov 19, 2024

View reviewed changes

tests/report/junit/data/main.fmf Outdated Show resolved Hide resolved

KwisatzHaderach approved these changes Nov 19, 2024

View reviewed changes

seberm added 3 commits November 21, 2024 14:35

Enable the huge_tree option for lxml parser

221bb96

Add disabled test which generates huge output into JUnit file

1baa35e

Enable and change the test command to a working one

5317bc7

seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 2a7c422 to 5317bc7 Compare November 21, 2024 14:42

seberm requested a review from thrix November 21, 2024 14:42

seberm added 2 commits November 21, 2024 17:22

Improve the test the way it executes the execute step only once

4850932

Use rlRun_LOG instead of teeing the output

41204e8

psss changed the title ~~Enable the huge_tree option for lxml parser~~ Enable the huge_tree option for the lxml parser Nov 21, 2024

martinhoyer reviewed Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the `huge_tree` option for the `lxml` parser #3365

Enable the `huge_tree` option for the `lxml` parser #3365

seberm commented Nov 18, 2024 •

edited

Loading

seberm commented Nov 18, 2024

seberm commented Nov 18, 2024

thrix commented Nov 19, 2024 •

edited

Loading

KwisatzHaderach left a comment

seberm commented Nov 19, 2024

seberm commented Nov 20, 2024

martinhoyer left a comment •

edited

Loading

Enable the huge_tree option for the lxml parser #3365

Are you sure you want to change the base?

Enable the huge_tree option for the lxml parser #3365

Conversation

seberm commented Nov 18, 2024 • edited Loading

TODOs:

seberm commented Nov 18, 2024

seberm commented Nov 18, 2024

thrix commented Nov 19, 2024 • edited Loading

KwisatzHaderach left a comment

Choose a reason for hiding this comment

seberm commented Nov 19, 2024

1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnit

2) LXML enabled (schema validation + pretty print), huge_tree=False

3) LXML enabled (schema validation + pretty print), huge_tree=True

(another?) problem I was facing

seberm commented Nov 20, 2024

martinhoyer left a comment • edited Loading

Choose a reason for hiding this comment

Enable the `huge_tree` option for the `lxml` parser #3365

Enable the `huge_tree` option for the `lxml` parser #3365

seberm commented Nov 18, 2024 •

edited

Loading

thrix commented Nov 19, 2024 •

edited

Loading

2) LXML enabled (schema validation + pretty print), `huge_tree=False`

3) LXML enabled (schema validation + pretty print), `huge_tree=True`

martinhoyer left a comment •

edited

Loading