You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's a corner case I ran into when trying to parse the chemical-mime project's XML:
irb(main):002:0> Ox::VERSION
=> "2.14.5"
irb(main):003:0> Ox::load_file('/home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/mime/packages/third-party/chemical-mime/chemical-mime-database.xml.in')
(irb):3:in `load_file': invalid format, dectype not terminated at line 1606, column 18 [parse.c:339] (Ox::ParseError)
from (irb):3:in `<main>'
from ./bin/repl:11:in `<main>'
This does seems to be valid XML syntax according to my reading of the spec: "[comments] may appear within the document type declaration at places allowed by the grammar" https://www.w3.org/TR/REC-xml/#sec-comments
The cause within Ox is the handling of those four special characters in parse.c's and sax.c'sread_delimited (both versions) as called by read_doctype:
I narrowed it down to some very basic test cases. The simplest case of comment-inside-doctype without special characters does parse successfully but leaves the comment as plain text in the Ox::DocType's value:
irb(main):001:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment -->\n] l>")
=>
#<Ox::Document:0x0000558c20da4380
@attributes={:version=>"1.0", :encoding=>"UTF-8"},
@nodes=[#<Ox::DocType:0x0000558c20da4268 @value="hey [\n<!-- this is a comment -->\n] l">]>
The other four all fail:
irb(main):002:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a < character -->\n] l>")
(irb):2:in `load': invalid format, dectype not terminated at line 4, column 8 [parse.c:339] (Ox::ParseError)
irb(main):003:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a \" character -->\n] l>")
(irb):3:in `load': invalid format, dectype not terminated at line 4, column 10 [parse.c:339] (Ox::ParseError)
irb(main):004:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a ' character -->\n] l>")
(irb):4:in `load': invalid format, dectype not terminated at line 4, column 10 [parse.c:339] (Ox::ParseError)
irb(main):005:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a [ character -->\n] l>")
(irb):5:in `load': invalid format, dectype not terminated at line 4, column 8 [parse.c:339] (Ox::ParseError)
I attempted to patch this myself but got a little lost trying to understand the interaction between the PInfo's position pointer and text value, and then very lost with the SAX parser's buffer checkpoint/checkback handling :p
For posterity here is as far as I got:
diff --git a/ext/ox/parse.c b/ext/ox/parse.c
index e9e28b3..0aa10ed 100644
--- a/ext/ox/parse.c
+++ b/ext/ox/parse.c
@@ -330,6 +330,11 @@ read_delimited(PInfo pi, char end) {
}
} else {
while (1) {
+ if (0 == strncmp("<!--", pi->s, 4)) {
+ pi->s += 4;
+ read_comment(pi);
+ break;
+ }
c = *pi->s++;
if (end == c) {
return;
Which fixed only the non-SAX parsing, still resulted in the comment text getting added to the DocType's value, and appended the new Comment to the top-level Document instead of to the DocType:
irb(main):002:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a < character -->\n] l>")
=>
#<Ox::Document:0x0000563eec630d90
@attributes={:version=>"1.0", :encoding=>"UTF-8"},
@nodes=
[#<Ox::Comment:0x0000563eec630c28 @value="this is a comment containing a < character">,
#<Ox::DocType:0x0000563eec630bd8 @value="hey [\n<!-- this is a comment containing a < character">]>
For my purposes I don't actually need the contents of any of these doctype comments, just for the rest of the file to parse successfully with my SAX parser. Feel free to consider this one the lowest of low-priority though :)
The text was updated successfully, but these errors were encountered:
Preparation for importing the Chemical MIME project's `shared-mime-info` package pending ohler55/ox#280 (totally no rush lol)
Minor re-wording to reflect the fact that not all primary types are IANA-approved.
okeeblow
added a commit
to okeeblow/DistorteD
that referenced
this issue
Sep 18, 2021
Preparation for importing the Chemical MIME project's `shared-mime-info` package pending ohler55/ox#280 (totally no rush lol)
Minor re-wording to reflect the fact that not all primary types are IANA-approved.
Here's a corner case I ran into when trying to parse the chemical-mime project's XML:
That line 1606 is the end of the file, so the parser ran away searching for an end-symbol that never occurred. The trigger is an XML comment inside a
<!DOCTYPE
declaration when that comment contains one of the characters"
,'
,[
, or<
. In my case it was the less-than symbol seen right here on line 81: https://github.com/dleidert/chemical-mime/blob/4fd66e3b3b7d922555d1e25587908b036805c45b/src/chemical-mime-database.xml.in#L81This does seems to be valid XML syntax according to my reading of the spec: "[comments] may appear within the document type declaration at places allowed by the grammar" https://www.w3.org/TR/REC-xml/#sec-comments
The cause within Ox is the handling of those four special characters in
parse.c
's andsax.c
'sread_delimited
(both versions) as called byread_doctype
:ox/ext/ox/parse.c
Lines 341 to 352 in 337b082
ox/ext/ox/sax.c
Lines 572 to 583 in 5cc4bff
I narrowed it down to some very basic test cases. The simplest case of comment-inside-doctype without special characters does parse successfully but leaves the comment as plain text in the
Ox::DocType
's value:The other four all fail:
I attempted to patch this myself but got a little lost trying to understand the interaction between the
PInfo
's position pointer and text value, and then very lost with the SAX parser's buffer checkpoint/checkback handling :pFor posterity here is as far as I got:
Which fixed only the non-SAX parsing, still resulted in the comment text getting added to the
DocType
's value, and appended the newComment
to the top-levelDocument
instead of to theDocType
:For my purposes I don't actually need the contents of any of these doctype comments, just for the rest of the file to parse successfully with my SAX parser. Feel free to consider this one the lowest of low-priority though :)
The text was updated successfully, but these errors were encountered: