Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA scanner not scanning content properly #125

Open
akshay-kr opened this issue Oct 28, 2024 · 15 comments
Open

CDATA scanner not scanning content properly #125

akshay-kr opened this issue Oct 28, 2024 · 15 comments

Comments

@akshay-kr
Copy link

It seems the CDATA scanning is behaving different after below commit,
49a31c0

For example if it has to scan <div></div> it just scans till <div and then append CDATA closing <div]]. Rest of the contents are not scanned.

I guess it is due to addition of below condition,

else if (c == '>') {
    // don't add the ]] to the buffer
    return false;
}

49a31c0#diff-c0ef5d78bafd2e2d032b106fd0ca50086966f97030bd6b6f4403d8f5ebe39f87R2490

I am using version 3.11.1

Any help here will be really appreciated.

@rbri
Copy link
Member

rbri commented Oct 29, 2024

Hi @akshay-kr,

correct, the cdata handling was changed to be in sync with the spec. You can read a bit about the reasoning for the changes in #102.

If you think there is an error in the parser please provide a small html sample. You can open this samples in your browser and then have a look at the DOM tree with the developer tools. If neko generates a different one i will try to fix it.

And curious to learn more about your use case....

@akshay-kr
Copy link
Author

akshay-kr commented Oct 29, 2024

Hi @rbri
Thanks for replying.

So basically if I have this as an input <![CDATA[<div></div>]]> then after CDATA scanning I get the output as <![CDATA[<div]]>]]&gt;.

The part after the first > inside CDATA section is ignored.

We allow users on our app to enter code snippets and we store these code snippets after sanitising it with Antisamy plugin. Now for example if user enters <div></div> as a code snippet only <div is stored after sanitisation.

It used to work properly before this committ.

Please do let me know if this information works for you else I will try to see if I can provide additional information.

@rbri
Copy link
Member

rbri commented Oct 29, 2024

So basically if I have this as an input ]]> then after CDATA scanning I get the output as ]]>.

I still think this is correct. Please play with the test-cdata-close-early-empty-tag.html file in your browser. The browser parses the file exactly the same way.
image
image

We allow users on our app to enter code snippets and we store these code snippets after sanitising it with Antisamy plugin. Now for example if user enters

as a code snippet only <div is stored after sanitisation.

Please have a look at the paper attached to #102 for more details.

@rbri
Copy link
Member

rbri commented Nov 1, 2024

@akshay-kr do you still think i have to fix something here?

@akshay-kr
Copy link
Author

@rbri I am now understanding what you are trying to explain.
Actually we are using Antisamy plugin to parse XML with content inside CDATA tag which used to work before. But it seems now this library is moving towards only html parsing so the same won't be supported anymore.

For example like this xml,

<xt:c-code xt:name="code" xt:version="1" xt:id="15ae0cc7-ded7-4a74-97b8-d66238d3c177"><xt:parameter xt:name="language">html</xt:parameter><xt:text-body><![CDATA[<div></div>]]></xt:text-body></xt:c-code>

We are parsing this XML and specifically the content inside CDATA and then storing it. Later when viewing we extract the content inside CDATA and render it on the web page.

@rbri
Copy link
Member

rbri commented Nov 1, 2024

Actually we are using Antisamy plugin to parse XML with content inside CDATA tag which used to work before

Maybe we have to discuss this with the Antisamy peoples.

@akshay-kr
Copy link
Author

Maybe we have to discuss this with the Antisamy peoples.

Sure. Raised an issue on Antisamy repo for the same nahsra/antisamy#531

@rbri
Copy link
Member

rbri commented Nov 4, 2024

@akshay-kr had another look at this (from the HtmlUnit point of view).

HtmlUnit uses neko to parse html and XHtml documents - and in case of XHtml the current behaviour of the parser is wrong.

Current plan: add another flag to disable the behaviour - but i have to think a bit about that.

@akshay-kr
Copy link
Author

Current plan: add another flag to disable the behaviour - but i have to think a bit about that.

Yup a flag to disable the behaviour will work. Or if there is a way user can pass what sort of document they are trying to parse then we can have separate logic for HTML and XHTML.
Though I agree a short solve will be to disable the behaviour by a flag.

@rbri
Copy link
Member

rbri commented Nov 4, 2024

have added the feature but i fear there is more to do if you like to use it from antisamy

@akshay-kr
Copy link
Author

akshay-kr commented Nov 4, 2024

@rbri Thanks a lot for adding this feature flag. Is it possible to back port this to 3.11.x branch as well? I am happy to create a PR for that.
We just started upgrading to 3.11.1 version and upgrading directly to 4.5.0 will be time consuming I guess since many things have changed and we need to validate if anything is breaking or not.

@akshay-kr
Copy link
Author

@rbri Created a PR for 3_11_3 branch #126. Please review.

@rbri
Copy link
Member

rbri commented Nov 4, 2024

@akshay-kr for my points about such fix releases please have a look at my comment in issur #123

And thanks for that pr.

@akshay-kr
Copy link
Author

@rbri I checked the comments on #123. I will check internally regarding sponsoring thing and will inform you.

@rbri
Copy link
Member

rbri commented Nov 26, 2024

@akshay-kr any news here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants