-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New parser implementation #690
base: master
Are you sure you want to change the base?
Conversation
Obviously this will take some time to review fully, I'll wait until it's ready unless there's something in particular you'd like me to look at. FYI, I'm not sure if you've seen your email but there was a party that was very interested in resolving #285 |
Main question now -- do you agree that DTD parsing can be moved to it's own crate, or you prefer to keep it as a module inside quick-xml? If use the separate crate, I'll extract it from this PR and publish separately and then add a dependency. Yeah, I've seen the email, just not found time to answer yet... but I will definitely find. |
The only reason I see for splitting functionality into a separate crate would be if it has significant utility as a library independent from the original library, or perhaps if it makes a significant difference in compile times. It's not enough code for compile time to be a major concern and contains no additional dependencies, and I struggle to see a DTD library being useful independent of an XML library. I suppose if it could be easily used with a different XML library, then that might be a sufficient reason? But otherwise I would just keep it as a module. |
//! A parser for encoding detection using BOM and heuristics. | ||
|
||
/// A result of feeding data into [`BomParser`]. | ||
#[derive(Copy, Clone, Debug, Eq, PartialEq)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, one of the sticking points on the encoding PR is/was that if you wrap the input stream of bytes in an decoding wrapper, then the BOM will always be translated to UTF-8 before the XML parser gets a chance to take a look at it.
It's a thoroughly fixable issue, but it means you have to either
A) poke a hole in the decoder to get a look at raw bytes and set the encoding manually from the outside, or
B) move the encoding detection heuristics to the inner decoding wrapper, or
C) read a couple of bytes, determine the encoding, set the encoding on the decoder from outside, and then replay the bytes previously read
All 3 required altering encoding-rs-io before moving forwards, or forking.
Replying to myself here: you mentioned both of these in your comment but I forgot to properly follow up on that. You said you actually did notice a compile time difference - how much? I saw a few crates such as https://crates.io/crates/dtd-parser which do DTD parsing, but I didn't look into how good or "correct" they are. Honestly, I don't care that much either way whether it is separate crate or module in quick-xml. I have a general preference for doing the simple thing initially unless there's an compelling reason not to, though. |
New parser decouples reading bytes from parsing
XmlSource will be removed soon, so cleaning up in advance
failures (30): syntax::cdata::unclosed03::async_tokio syntax::cdata::unclosed04::async_tokio syntax::cdata::unclosed06::async_tokio syntax::cdata::unclosed07::async_tokio syntax::cdata::unclosed09::async_tokio syntax::cdata::unclosed10::async_tokio syntax::cdata::unclosed12::async_tokio syntax::cdata::unclosed13::async_tokio syntax::cdata::unclosed15::async_tokio syntax::cdata::unclosed16::async_tokio syntax::cdata::unclosed18::async_tokio syntax::cdata::unclosed19::async_tokio syntax::comment::unclosed03::async_tokio syntax::comment::unclosed04::async_tokio syntax::doctype::unclosed03::async_tokio syntax::doctype::unclosed04::async_tokio syntax::doctype::unclosed06::async_tokio syntax::doctype::unclosed07::async_tokio syntax::doctype::unclosed09::async_tokio syntax::doctype::unclosed10::async_tokio syntax::doctype::unclosed12::async_tokio syntax::doctype::unclosed13::async_tokio syntax::doctype::unclosed15::async_tokio syntax::doctype::unclosed16::async_tokio syntax::doctype::unclosed18::async_tokio syntax::doctype::unclosed19::async_tokio syntax::unclosed_bang1::async_tokio syntax::unclosed_bang2::async_tokio syntax::unclosed_bang3::async_tokio syntax::unclosed_bang4::async_tokio
failures (60): syntax::cdata::unclosed03::async_tokio syntax::cdata::unclosed03::buffered syntax::cdata::unclosed04::async_tokio syntax::cdata::unclosed04::buffered syntax::cdata::unclosed06::async_tokio syntax::cdata::unclosed06::buffered syntax::cdata::unclosed07::async_tokio syntax::cdata::unclosed07::buffered syntax::cdata::unclosed09::async_tokio syntax::cdata::unclosed09::buffered syntax::cdata::unclosed10::async_tokio syntax::cdata::unclosed10::buffered syntax::cdata::unclosed12::async_tokio syntax::cdata::unclosed12::buffered syntax::cdata::unclosed13::async_tokio syntax::cdata::unclosed13::buffered syntax::cdata::unclosed15::async_tokio syntax::cdata::unclosed15::buffered syntax::cdata::unclosed16::async_tokio syntax::cdata::unclosed16::buffered syntax::cdata::unclosed18::async_tokio syntax::cdata::unclosed18::buffered syntax::cdata::unclosed19::async_tokio syntax::cdata::unclosed19::buffered syntax::comment::unclosed03::async_tokio syntax::comment::unclosed03::buffered syntax::comment::unclosed04::async_tokio syntax::comment::unclosed04::buffered syntax::doctype::unclosed03::async_tokio syntax::doctype::unclosed03::buffered syntax::doctype::unclosed04::async_tokio syntax::doctype::unclosed04::buffered syntax::doctype::unclosed06::async_tokio syntax::doctype::unclosed06::buffered syntax::doctype::unclosed07::async_tokio syntax::doctype::unclosed07::buffered syntax::doctype::unclosed09::async_tokio syntax::doctype::unclosed09::buffered syntax::doctype::unclosed10::async_tokio syntax::doctype::unclosed10::buffered syntax::doctype::unclosed12::async_tokio syntax::doctype::unclosed12::buffered syntax::doctype::unclosed13::async_tokio syntax::doctype::unclosed13::buffered syntax::doctype::unclosed15::async_tokio syntax::doctype::unclosed15::buffered syntax::doctype::unclosed16::async_tokio syntax::doctype::unclosed16::buffered syntax::doctype::unclosed18::async_tokio syntax::doctype::unclosed18::buffered syntax::doctype::unclosed19::async_tokio syntax::doctype::unclosed19::buffered syntax::unclosed_bang1::async_tokio syntax::unclosed_bang1::buffered syntax::unclosed_bang2::async_tokio syntax::unclosed_bang2::buffered syntax::unclosed_bang3::async_tokio syntax::unclosed_bang3::buffered syntax::unclosed_bang4::async_tokio syntax::unclosed_bang4::buffered
failures (90): syntax::cdata::unclosed03::async_tokio syntax::cdata::unclosed03::borrowed syntax::cdata::unclosed03::buffered syntax::cdata::unclosed04::async_tokio syntax::cdata::unclosed04::borrowed syntax::cdata::unclosed04::buffered syntax::cdata::unclosed06::async_tokio syntax::cdata::unclosed06::borrowed syntax::cdata::unclosed06::buffered syntax::cdata::unclosed07::async_tokio syntax::cdata::unclosed07::borrowed syntax::cdata::unclosed07::buffered syntax::cdata::unclosed09::async_tokio syntax::cdata::unclosed09::borrowed syntax::cdata::unclosed09::buffered syntax::cdata::unclosed10::async_tokio syntax::cdata::unclosed10::borrowed syntax::cdata::unclosed10::buffered syntax::cdata::unclosed12::async_tokio syntax::cdata::unclosed12::borrowed syntax::cdata::unclosed12::buffered syntax::cdata::unclosed13::async_tokio syntax::cdata::unclosed13::borrowed syntax::cdata::unclosed13::buffered syntax::cdata::unclosed15::async_tokio syntax::cdata::unclosed15::borrowed syntax::cdata::unclosed15::buffered syntax::cdata::unclosed16::async_tokio syntax::cdata::unclosed16::borrowed syntax::cdata::unclosed16::buffered syntax::cdata::unclosed18::async_tokio syntax::cdata::unclosed18::borrowed syntax::cdata::unclosed18::buffered syntax::cdata::unclosed19::async_tokio syntax::cdata::unclosed19::borrowed syntax::cdata::unclosed19::buffered syntax::comment::unclosed03::async_tokio syntax::comment::unclosed03::borrowed syntax::comment::unclosed03::buffered syntax::comment::unclosed04::async_tokio syntax::comment::unclosed04::borrowed syntax::comment::unclosed04::buffered syntax::doctype::unclosed03::async_tokio syntax::doctype::unclosed03::borrowed syntax::doctype::unclosed03::buffered syntax::doctype::unclosed04::async_tokio syntax::doctype::unclosed04::borrowed syntax::doctype::unclosed04::buffered syntax::doctype::unclosed06::async_tokio syntax::doctype::unclosed06::borrowed syntax::doctype::unclosed06::buffered syntax::doctype::unclosed07::async_tokio syntax::doctype::unclosed07::borrowed syntax::doctype::unclosed07::buffered syntax::doctype::unclosed09::async_tokio syntax::doctype::unclosed09::borrowed syntax::doctype::unclosed09::buffered syntax::doctype::unclosed10::async_tokio syntax::doctype::unclosed10::borrowed syntax::doctype::unclosed10::buffered syntax::doctype::unclosed12::async_tokio syntax::doctype::unclosed12::borrowed syntax::doctype::unclosed12::buffered syntax::doctype::unclosed13::async_tokio syntax::doctype::unclosed13::borrowed syntax::doctype::unclosed13::buffered syntax::doctype::unclosed15::async_tokio syntax::doctype::unclosed15::borrowed syntax::doctype::unclosed15::buffered syntax::doctype::unclosed16::async_tokio syntax::doctype::unclosed16::borrowed syntax::doctype::unclosed16::buffered syntax::doctype::unclosed18::async_tokio syntax::doctype::unclosed18::borrowed syntax::doctype::unclosed18::buffered syntax::doctype::unclosed19::async_tokio syntax::doctype::unclosed19::borrowed syntax::doctype::unclosed19::buffered syntax::unclosed_bang1::async_tokio syntax::unclosed_bang1::borrowed syntax::unclosed_bang1::buffered syntax::unclosed_bang2::async_tokio syntax::unclosed_bang2::borrowed syntax::unclosed_bang2::buffered syntax::unclosed_bang3::async_tokio syntax::unclosed_bang3::borrowed syntax::unclosed_bang3::buffered syntax::unclosed_bang4::async_tokio syntax::unclosed_bang4::borrowed syntax::unclosed_bang4::buffered
…oved unused arms in match)
(review with with whitespace ignored mode)
(review with with whitespace ignored mode)
It also will be used in a new BytesPI event that will be introduced in a separate PR
I was not clear enough. The point is that if create a new module inside quick-xml then I need to wait until the whole quick-xml will be compiled before I can see the difference about changes that I've made in a new module. quick-xml now takes about 10 seconds to compile on my machine. The new crate takes only 1-2 seconds. Therefore, it is simply more convenient to develop a very autonomous part of the parser in a separate crate. I've seen https://crates.io/crates/dtd-parser and while it uses high-performant
Ok. Then I think I keep the current approach and implement DTD parser in it's own crate for now. Then it will be simpler to enhance it |
The draft of the new parser implementation, based on approach used in quickest XML parser implementation according to our tests -- maybe_xml.
This PR is created to get early feedback. The new parser comes from my attempt to implement correct parsing of DTD. When I wrote DTD parser I found that compiling quick-xml takes a little more time than I would like, so I implemented it in the separate crate. this has advantages and disadvantages:
The new
quick-dtd
crate currently included inquick-xml
source tree just for easy prototyping. In the final PR it should stay completely independent.Things that need to finish:
quick-dtd
as a separate crate or implement DTD parsing in adtd
module of quick-xml. Of course, I plan to give access of all maintainers of quick-xml to the new crate so they does not lost control over this crate. Is it better to keep DTD parsing in quick-xml? What do you think?memchr
where possible, but maybe that would unnecessary. Need to check benchmarks