-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selfclosing Break Line Tag <br /> tag in html content is converted into <br> open tag. #484
Comments
@spassarop @rbri - Can you look into this? |
Wrote a short test (also one for neko (HtmlUnit/htmlunit-neko@1d19ec2)) but can't see why anything should have changed. |
@BloodDrag0n do you think the output was different with other versions? |
I have been using Antisamy-1.6.4 with nekohtml-1.9.22, it does not had any change like this in Where this behavior was found. In 1.6.4 - self closing tags like is converted into seperare open and close tags but in 1.7.5 - is converted into . This doesn't caused any parsing issue for us, but some of our integration products demands either selfclosing ( |
So the removal of XHTML usage has caused this right, is there a way to achieve this in the latest version? |
@rbri , This behavior is also seen for |
We are running into this issue too. We do use XHTML and when we updated antisamy to fix a cve, we ran into this. Any chance of bring back XHTML support? Our current workaround is to create a custom antisamy jar with the code for XHTML added back in. |
@spassarop - Can we fix this without adding XHTML support back? That's probably an @rbri question. And even if we can, what would you think about adding XHTML support back, but have it OFF by default, so groups like @tw-mcummings could enable it if they need it? Does that make any sense to you? |
I checked the behavior from XHTML back in 2022. It was deprecated as a project decision because at the time it was related to a vulnerability and also it did not make much sense to keep supporting that as it is not HTML by the spec, it enforces XML structure. Checking old emails I found this comments:
That was inside the custom HTML serializer, the only one left as the XHTML one was in process of being deleted. What I added to the code was if (rawName == null || !HTMLdtd.isOnlyOpening(rawName) || HTMLdtd.isOptionalClosing(rawName)) {
if (_indenting && !state.preserveSpace && state.afterElement)
_printer.breakLine();
// Must leave CData section first (Illegal in HTML, but still)
if (state.inCData)
_printer.printText("]]>");
_printer.printText("</");
_printer.printText(state.rawName);
_printer.printText('>');
} To make it a bit more compatible and adapted the tests later. Those are @rbri marked commits on the picture. However, checking the XHTML serializer again, it shared a line with // HTML serializer
if (state.empty) _printer.printText('>');
// instant closing tag logic assuming no content inside the tags
// ... // XHTML serializer
if (state.empty && isAllowedEmptyTag(rawName) && !requiresClosingTag(rawName)) {
_printer.printText(" />");
} else {
if (state.empty) _printer.printText('>');
// instant closing tag logic assuming no content inside the tags
//... That is a big difference. What I did then, was copy the code and behavior to use the XHTML serializer fragment on the HTML serializer code. Then ran the tests to see if the ones associated with the vulnerabilities failed. They did not fail, which is good. The ones that failed did it due to the However, there are some considerations: Any comments? I mean, is looks like a good workaround that solves the initial issue, but with those small details. |
@spassarop can you please make a PR out of your changes - that will it make easier to follow and to test |
Done. I was about to push adapted tests but I found out DOM parser is ignoring my change, it only affects SAX. I've been debugging a while and could not find out a way to make it work when invoking the serializer on So I only added a modified version of the test @rbri wrote to reflect that. At the moment, my change gives a partial solution and people would need to use SAX if they want the closing tag behavior as it is requested. If we leave it that way, some tests would need to be adapted to have the SAX and DOM version closing the tags differently. Even though, I don't understand why the DOM parser behaves like that if it did not before with the custom logic I added... maybe a serializer dependency changed on the last two years? FYI: Deprecating DOM parser was on the table for a while. This might be a new reason to do so. |
@spassarop - I'll let you and @rbri figure out the best way to handle all this as you two are the experts on this stuff. |
i think we have to rename 'issue484()' to 'testGithubIssue484()' and i'm still a bit confused - issue484() passes for sax and dom - @spassarop i can't see the problem with the dom handling |
It’s not that the test fails because I just copied it. The thing is, for
DOM it does not auto close single tags like <br>, SAX does with what I
added. But when they used XHTML in the past they both behaved the same,
adding /> and not just >. The discrepancy is what I see as a problem.
…On Thu, 15 Aug 2024 at 13:09 RBRi ***@***.***> wrote:
i think we have to rename 'issue484()' to 'testGithubIssue484()'
and i'm still a bit confused - issue484() passes for sax and dom -
@spassarop <https://github.com/spassarop> i can't see the problem with
the dom handling
—
Reply to this email directly, view it on GitHub
<#484 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHL3BMIRDLIJQZCQBZ6S6TLZRTHDXAVCNFSM6AAAAABK6LKR32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRGYYTMMJRGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@spassarop oh i was blind... sorry for the confusion |
The trick is in class HTMLSerializer - they use an optimization when serializing dom nodes.
Instead of calling startElement/endElement they call this serializeElement method; therefor your patch for the endElement does not work (for the dom serialization). You can copy the code of the method to our ASHTMLSerializer and massage the code a bit
This fixes at least the current problem. |
We really have to think about using our own independent serializer - i guess it might be possible to carve out the serializing code from xerces using some hours. But i have no idea what does that mean regarding copyright etc. |
I applied your suggestions to the current PR, everything seems consistent and working now. Regarding having an AntiSamy serializer, of course it would be the best option in terms of controlling the output shape. The cons are the one you say about legal stuff, and that other issues may arise with time as it happens with Neko and we may not be aware. Someone should be checking periodically if the reused code is affected and make fixes fit. Trade offs I guess... |
@BloodDrag0n @tw-mcummings - Could you build and test this branch to determine if the new behavior fits your needs? |
I am using Antisamy to sanitize HTML contents. During the parsing of the HTML Data, the
(self-closing) tag is converted into
(open tag). Is there any specific reason behind this behavior change? Is there any way to retain the
tags?
Eg Input Html Data:
<p>this is para data</p><br/><p>this is para data 2</p>
Eg Output Html Data:
<p>this is para data</p><br><p>this is para data 2</p>
HtmlUnit-Neko version using - 3.11.2
Antisamy version using - 1.7.5
The text was updated successfully, but these errors were encountered: