Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the huge_tree option for the lxml parser #3365

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

seberm
Copy link
Collaborator

@seberm seberm commented Nov 18, 2024

This PR should resolve the problem with the LXML limit on the size of text nodes it can handle:

For more info about a huge_tree option, please see:

Especially this LXML FAQ section about security concerns:

Is lxml vulnerable to XML bombs?

This has nothing to do with lxml itself, only with the parser of libxml2. Since libxml2 version 2.7, the parser imposes hard security limits on input documents to prevent DoS attacks with forged input data. Since lxml 2.2.1, you can disable these limits with the huge_tree parser option if you need to parse really large, trusted documents. All lxml versions will leave these restrictions enabled by default.

Note that libxml2 versions of the 2.6 series do not restrict their parser and are therefore vulnerable to DoS attacks.

TODOs:

  • Try to test the test with huge output without lxml (with just jinja) to check if bottleneck is really lxml (schema or pretty print)
  • Enable the "huge output" test in the CI? -> yes

Pull Request Checklist:

  • implement the feature
  • write the documentation
  • extend the test coverage
  • update the specification
  • adjust plugin docstring
  • modify the json schema
  • mention the version
  • include a release note

@seberm seberm added plugin | junit The junit report plugin plugin | reportportal The reportportal report plugin labels Nov 18, 2024
@seberm seberm added this to the 1.40 milestone Nov 18, 2024
@seberm seberm added the ci | full test Pull request is ready for the full test execution label Nov 18, 2024
@seberm
Copy link
Collaborator Author

seberm commented Nov 18, 2024

Hello @thrix, @happz, @psss ,
Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

@seberm
Copy link
Collaborator Author

seberm commented Nov 18, 2024

@KwisatzHaderach Are you able to somehow test this change and let me know if it resolves your problem?

The processing of huge files using LXML can take a lot of resources (esp. CPU). We can always add an option/condition to completely bypass the LXML processing and use just Jinja2. The LXML is used to just check the XML schema for non-custom JUnit flavors and for prettifying the XML output.

@thrix
Copy link
Collaborator

thrix commented Nov 19, 2024

Hello @thrix, @happz, @psss , Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

@seberm fine with me, the XML is not something user can easily inject I assume, if he wants to DOS us he has a lot of other options directly from the tests anyway.

Copy link
Collaborator

@KwisatzHaderach KwisatzHaderach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't get a chance to test it yet, but approving since it looks good and we need this sooner rather than later

@seberm
Copy link
Collaborator Author

seberm commented Nov 19, 2024

So I gave it an another round of tests. Let's take following test as an example:

require:
  - python
result: fail
test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)"

As you can see from the results below, enabling the huge_tree=True option makes sense, and I think it will help solve the #3363 issue.

1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnit

Takes around 43s on my machine to generate, but it works.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    43.01s user 2.08s system 71% cpu 1:03.41 total

2) LXML enabled (schema validation + pretty print), huge_tree=False

Takes around 44s on my machine, it crashes on huge text node error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.30s user 2.47s system 69% cpu 1:06.87 total

3) LXML enabled (schema validation + pretty print), huge_tree=True

Takes also around 44s on my machine, it works without error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.22s user 2.20s system 71% cpu 1:04.93 total

(another?) problem I was facing

If you change the test to the following command (similar amount of data as in previous test but no line ending):

test: python -c "print('a' * (10 * 1024 * 1024 + 1))"

The tmt is still "working" on generating the report and it seems like it will never finish. This behavior is the same with or without the LXML installed.

I would appreciate it if someone could help me to profile this test in the tmt, so we could decide if this is actually a problem that needs to be solved. Perhaps not? Any ideas?

@seberm
Copy link
Collaborator Author

seberm commented Nov 20, 2024

@psss @happz Do you think it's a good idea to enable the test mentioned above (test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)") in the CI? I am not sure if it's a good idea to slow down the CI in this way. Thanks.

@seberm seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 2a7c422 to 5317bc7 Compare November 21, 2024 14:42
@seberm seberm requested a review from thrix November 21, 2024 14:42
@psss psss changed the title Enable the huge_tree option for lxml parser Enable the huge_tree option for the lxml parser Nov 21, 2024
Copy link
Collaborator

@martinhoyer martinhoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, we want large depth (above 256) to require huge_tree.

We played with it locally with:

DEPTH=257
OUTPUT_FILE="deep.xml"

# Start the XML file with a root tag
echo "<root>" > "$OUTPUT_FILE"

# Generate nested tags
for i in $(seq 1 $DEPTH); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "<tag$i>" >> "$OUTPUT_FILE"
done

# Close the nested tags in reverse order
for i in $(seq $DEPTH -1 1); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "</tag$i>" >> "$OUTPUT_FILE"
done

# Close the root tag
echo "</root>" >> "$OUTPUT_FILE"

And then I thought we could catch the exception and only use huge_tree, when it's necessary?

from lxml import etree

def parse_with_huge_tree(file_path):
    try:
        tree = etree.parse(file_path)
        print("XML parsed successfully (default settings).")
        return tree
    except etree.LxmlError as e:
        print(f"LxmlError encountered: {e}")
        print("Retrying with huge_tree=True...")
        try:
            parser = etree.XMLParser(huge_tree=True)
            tree = etree.parse(file_path, parser)
            print("XML parsed successfully with huge_tree=True.")
            return tree
        except etree.LxmlError as retry_error:
            print(f"Retry failed with huge_tree=True: {retry_error}")
        except Exception as retry_exception:
            print(f"Unexpected error during retry: {retry_exception}")
    except Exception as e:
        print(f"Unexpected error: {e}")

# Usage
parse_with_huge_tree('./deep.xml')

Make sense?

(and maybe print a warning that it's quite large and we are being forced to use huge_tree, security vulnerability, et cetera)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci | full test Pull request is ready for the full test execution plugin | junit The junit report plugin plugin | reportportal The reportportal report plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Polarion XML fails to generate due to xmlSAX2Characters: huge text node
4 participants