CDATA scanner not scanning content properly #125

akshay-kr · 2024-10-28T21:08:19Z

It seems the CDATA scanning is behaving different after below commit,
49a31c0

For example if it has to scan <div></div> it just scans till <div and then append CDATA closing <div]]. Rest of the contents are not scanned.

I guess it is due to addition of below condition,

else if (c == '>') {
    // don't add the ]] to the buffer
    return false;
}

49a31c0#diff-c0ef5d78bafd2e2d032b106fd0ca50086966f97030bd6b6f4403d8f5ebe39f87R2490

I am using version 3.11.1

Any help here will be really appreciated.

The text was updated successfully, but these errors were encountered:

rbri · 2024-10-29T06:33:56Z

Hi @akshay-kr,

correct, the cdata handling was changed to be in sync with the spec. You can read a bit about the reasoning for the changes in #102.

If you think there is an error in the parser please provide a small html sample. You can open this samples in your browser and then have a look at the DOM tree with the developer tools. If neko generates a different one i will try to fix it.

And curious to learn more about your use case....

akshay-kr · 2024-10-29T07:30:29Z

Hi @rbri
Thanks for replying.

So basically if I have this as an input <![CDATA[<div></div>]]> then after CDATA scanning I get the output as <![CDATA[<div]]>]]>.

The part after the first > inside CDATA section is ignored.

We allow users on our app to enter code snippets and we store these code snippets after sanitising it with Antisamy plugin. Now for example if user enters <div></div> as a code snippet only <div is stored after sanitisation.

It used to work properly before this committ.

Please do let me know if this information works for you else I will try to see if I can provide additional information.

rbri · 2024-10-29T12:28:01Z

So basically if I have this as an input ]]> then after CDATA scanning I get the output as ]]>.

I still think this is correct. Please play with the test-cdata-close-early-empty-tag.html file in your browser. The browser parses the file exactly the same way.

We allow users on our app to enter code snippets and we store these code snippets after sanitising it with Antisamy plugin. Now for example if user enters
as a code snippet only <div is stored after sanitisation.

Please have a look at the paper attached to #102 for more details.

rbri · 2024-11-01T05:50:36Z

@akshay-kr do you still think i have to fix something here?

akshay-kr · 2024-11-01T07:25:54Z

@rbri I am now understanding what you are trying to explain.
Actually we are using Antisamy plugin to parse XML with content inside CDATA tag which used to work before. But it seems now this library is moving towards only html parsing so the same won't be supported anymore.

For example like this xml,

<xt:c-code xt:name="code" xt:version="1" xt:id="15ae0cc7-ded7-4a74-97b8-d66238d3c177"><xt:parameter xt:name="language">html</xt:parameter><xt:text-body><![CDATA[<div></div>]]></xt:text-body></xt:c-code>

We are parsing this XML and specifically the content inside CDATA and then storing it. Later when viewing we extract the content inside CDATA and render it on the web page.

rbri · 2024-11-01T15:54:02Z

Actually we are using Antisamy plugin to parse XML with content inside CDATA tag which used to work before

Maybe we have to discuss this with the Antisamy peoples.

akshay-kr · 2024-11-04T06:22:19Z

Maybe we have to discuss this with the Antisamy peoples.

Sure. Raised an issue on Antisamy repo for the same nahsra/antisamy#531

rbri · 2024-11-04T10:38:52Z

@akshay-kr had another look at this (from the HtmlUnit point of view).

HtmlUnit uses neko to parse html and XHtml documents - and in case of XHtml the current behaviour of the parser is wrong.

Current plan: add another flag to disable the behaviour - but i have to think a bit about that.

akshay-kr · 2024-11-04T18:46:33Z

Current plan: add another flag to disable the behaviour - but i have to think a bit about that.

Yup a flag to disable the behaviour will work. Or if there is a way user can pass what sort of document they are trying to parse then we can have separate logic for HTML and XHTML.
Though I agree a short solve will be to disable the behaviour by a flag.

rbri · 2024-11-04T19:25:03Z

have added the feature but i fear there is more to do if you like to use it from antisamy

akshay-kr · 2024-11-04T20:25:33Z

@rbri Thanks a lot for adding this feature flag. Is it possible to back port this to 3.11.x branch as well? I am happy to create a PR for that.
We just started upgrading to 3.11.1 version and upgrading directly to 4.5.0 will be time consuming I guess since many things have changed and we need to validate if anything is breaking or not.

akshay-kr · 2024-11-04T22:04:01Z

@rbri Created a PR for 3_11_3 branch #126. Please review.

rbri · 2024-11-04T22:10:42Z

@akshay-kr for my points about such fix releases please have a look at my comment in issur #123

And thanks for that pr.

akshay-kr · 2024-11-04T22:50:02Z

@rbri I checked the comments on #123. I will check internally regarding sponsoring thing and will inform you.

rbri · 2024-11-26T07:20:57Z

@akshay-kr any news here?

akshay-kr mentioned this issue Nov 4, 2024

CDATA scanning in XML not behaving properly nahsra/antisamy#531

Open

akshay-kr mentioned this issue Nov 4, 2024

Add CDATA early closing feature flag #126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDATA scanner not scanning content properly #125

CDATA scanner not scanning content properly #125

akshay-kr commented Oct 28, 2024

rbri commented Oct 29, 2024

akshay-kr commented Oct 29, 2024 •

edited

Loading

rbri commented Oct 29, 2024

rbri commented Nov 1, 2024

akshay-kr commented Nov 1, 2024

rbri commented Nov 1, 2024

akshay-kr commented Nov 4, 2024

rbri commented Nov 4, 2024

akshay-kr commented Nov 4, 2024

rbri commented Nov 4, 2024

akshay-kr commented Nov 4, 2024 •

edited

Loading

akshay-kr commented Nov 4, 2024

rbri commented Nov 4, 2024

akshay-kr commented Nov 4, 2024

rbri commented Nov 26, 2024

CDATA scanner not scanning content properly #125

CDATA scanner not scanning content properly #125

Comments

akshay-kr commented Oct 28, 2024

rbri commented Oct 29, 2024

akshay-kr commented Oct 29, 2024 • edited Loading

rbri commented Oct 29, 2024

rbri commented Nov 1, 2024

akshay-kr commented Nov 1, 2024

rbri commented Nov 1, 2024

akshay-kr commented Nov 4, 2024

rbri commented Nov 4, 2024

akshay-kr commented Nov 4, 2024

rbri commented Nov 4, 2024

akshay-kr commented Nov 4, 2024 • edited Loading

akshay-kr commented Nov 4, 2024

rbri commented Nov 4, 2024

akshay-kr commented Nov 4, 2024

rbri commented Nov 26, 2024

akshay-kr commented Oct 29, 2024 •

edited

Loading

akshay-kr commented Nov 4, 2024 •

edited

Loading