Parse error for comments containing special characters inside a DOCTYPE declaration #280

okeeblow · 2021-09-02T01:27:15Z

Here's a corner case I ran into when trying to parse the chemical-mime project's XML:

irb(main):002:0> Ox::VERSION
=> "2.14.5"
irb(main):003:0> Ox::load_file('/home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/mime/packages/third-party/chemical-mime/chemical-mime-database.xml.in')
(irb):3:in `load_file': invalid format, dectype not terminated at line 1606, column 18 [parse.c:339] (Ox::ParseError)
        from (irb):3:in `<main>'
        from ./bin/repl:11:in `<main>'

That line 1606 is the end of the file, so the parser ran away searching for an end-symbol that never occurred. The trigger is an XML comment inside a <!DOCTYPE declaration when that comment contains one of the characters ", ', [, or <. In my case it was the less-than symbol seen right here on line 81: https://github.com/dleidert/chemical-mime/blob/4fd66e3b3b7d922555d1e25587908b036805c45b/src/chemical-mime-database.xml.in#L81

This does seems to be valid XML syntax according to my reading of the spec: "[comments] may appear within the document type declaration at places allowed by the grammar" https://www.w3.org/TR/REC-xml/#sec-comments

The cause within Ox is the handling of those four special characters in parse.c's and sax.c'sread_delimited (both versions) as called by read_doctype:

ox/ext/ox/parse.c

Lines 341 to 352 in 337b082

    
              case '"': 
        
           read_delimited(pi, c); 
        
           break; 
        
              case '\'': 
        
           read_delimited(pi, c); 
        
           break; 
        
              case '[': 
        
           read_delimited(pi, ']'); 
        
           break; 
        
              case '<': 
        
           read_delimited(pi, '>'); 
        
           break;

ox/ext/ox/sax.c

Lines 572 to 583 in 5cc4bff

    
              case '"': 
        
           c = read_delimited(dr, c); 
        
           break; 
        
              case '\'': 
        
           c = read_delimited(dr, c); 
        
           break; 
        
              case '[': 
        
           c = read_delimited(dr, ']'); 
        
           break; 
        
              case '<': 
        
           c = read_delimited(dr, '>'); 
        
           break;

I narrowed it down to some very basic test cases. The simplest case of comment-inside-doctype without special characters does parse successfully but leaves the comment as plain text in the Ox::DocType's value:

irb(main):001:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment -->\n] l>")
=> 
#<Ox::Document:0x0000558c20da4380
 @attributes={:version=>"1.0", :encoding=>"UTF-8"},
 @nodes=[#<Ox::DocType:0x0000558c20da4268 @value="hey [\n<!-- this is a comment -->\n] l">]>

The other four all fail:

irb(main):002:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a < character -->\n] l>")
(irb):2:in `load': invalid format, dectype not terminated at line 4, column 8 [parse.c:339] (Ox::ParseError)

irb(main):003:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a \" character -->\n] l>")
(irb):3:in `load': invalid format, dectype not terminated at line 4, column 10 [parse.c:339] (Ox::ParseError)

irb(main):004:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a ' character -->\n] l>")
(irb):4:in `load': invalid format, dectype not terminated at line 4, column 10 [parse.c:339] (Ox::ParseError)

irb(main):005:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a [ character -->\n] l>")
(irb):5:in `load': invalid format, dectype not terminated at line 4, column 8 [parse.c:339] (Ox::ParseError)

I attempted to patch this myself but got a little lost trying to understand the interaction between the PInfo's position pointer and text value, and then very lost with the SAX parser's buffer checkpoint/checkback handling :p

For posterity here is as far as I got:

diff --git a/ext/ox/parse.c b/ext/ox/parse.c
index e9e28b3..0aa10ed 100644
--- a/ext/ox/parse.c
+++ b/ext/ox/parse.c
@@ -330,6 +330,11 @@ read_delimited(PInfo pi, char end) {
        }
     } else {
        while (1) {
+      if (0 == strncmp("<!--", pi->s, 4)) {
+    pi->s += 4;
+    read_comment(pi);
+    break;
+      }
            c = *pi->s++;
            if (end == c) {
                return;

Which fixed only the non-SAX parsing, still resulted in the comment text getting added to the DocType's value, and appended the new Comment to the top-level Document instead of to the DocType:

irb(main):002:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a < character -->\n] l>")
=>                                                                                       
#<Ox::Document:0x0000563eec630d90                                                                                                 
 @attributes={:version=>"1.0", :encoding=>"UTF-8"},                                                                                                    
 @nodes=                                                                                                                          
  [#<Ox::Comment:0x0000563eec630c28 @value="this is a comment containing a < character">,                                         
   #<Ox::DocType:0x0000563eec630bd8 @value="hey [\n<!-- this is a comment containing a < character">]>

For my purposes I don't actually need the contents of any of these doctype comments, just for the rest of the file to parse successfully with my SAX parser. Feel free to consider this one the lowest of low-priority though :)

The text was updated successfully, but these errors were encountered:

ohler55 · 2021-09-02T02:01:15Z

I'll take a look and get it fixed. It certainly is an edge case. I appreciate the detective work too.

Preparation for importing the Chemical MIME project's `shared-mime-info` package pending ohler55/ox#280 (totally no rush lol) Minor re-wording to reflect the fact that not all primary types are IANA-approved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse error for comments containing special characters inside a DOCTYPE declaration #280

Parse error for comments containing special characters inside a DOCTYPE declaration #280

okeeblow commented Sep 2, 2021

ohler55 commented Sep 2, 2021

Parse error for comments containing special characters inside a DOCTYPE declaration #280

Parse error for comments containing special characters inside a DOCTYPE declaration #280

Comments

okeeblow commented Sep 2, 2021

ohler55 commented Sep 2, 2021