Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parser implementation #690

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from
Draft

New parser implementation #690

wants to merge 22 commits into from

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Nov 29, 2023

The draft of the new parser implementation, based on approach used in quickest XML parser implementation according to our tests -- maybe_xml.

This PR is created to get early feedback. The new parser comes from my attempt to implement correct parsing of DTD. When I wrote DTD parser I found that compiling quick-xml takes a little more time than I would like, so I implemented it in the separate crate. this has advantages and disadvantages:

  • 👍 There are no crates for correct DTD parsing so we could occupy the free niche;
  • 👍 More fast development;
  • 👎 Potentialy a slight performance regression because compiler could not be able to inline functions ccross crate boundaries, but enabling LTO should fix that problem;
  • 👎 Potentialy more hard to maintain two crates instead of one

The new quick-dtd crate currently included in quick-xml source tree just for easy prototyping. In the final PR it should stay completely independent.

Things that need to finish:

  • Decide if we want to make quick-dtd as a separate crate or implement DTD parsing in a dtd module of quick-xml. Of course, I plan to give access of all maintainers of quick-xml to the new crate so they does not lost control over this crate. Is it better to keep DTD parsing in quick-xml? What do you think?
  • Make tests pass. I think, that we should not follow current tests assumptions in all cases, some tests should be adjusted instead
  • Run benchmarks. I expect that performance should improve, but did not yet check that. It should be possible even in the current state. The failed tests are all about errors reporting (note that debug printing in the last commit slows down the parser A LOT, especially asynchronous).
  • Use memchr where possible, but maybe that would unnecessary. Need to check benchmarks
  • Check if this fixes some bugs from bugtracker

@Mingun Mingun added enhancement optimization Issues related to reducing time needed to parse XML or to memory consumption labels Nov 29, 2023
@Mingun Mingun requested a review from dralley November 29, 2023 20:46
Changelog.md Outdated Show resolved Hide resolved
Changelog.md Show resolved Hide resolved
@dralley
Copy link
Collaborator

dralley commented Nov 29, 2023

Obviously this will take some time to review fully, I'll wait until it's ready unless there's something in particular you'd like me to look at.

FYI, I'm not sure if you've seen your email but there was a party that was very interested in resolving #285

@Mingun Mingun changed the title New parser implementetion New parser implementation Nov 29, 2023
@Mingun
Copy link
Collaborator Author

Mingun commented Nov 29, 2023

Main question now -- do you agree that DTD parsing can be moved to it's own crate, or you prefer to keep it as a module inside quick-xml? If use the separate crate, I'll extract it from this PR and publish separately and then add a dependency.

Yeah, I've seen the email, just not found time to answer yet... but I will definitely find.

@dralley
Copy link
Collaborator

dralley commented Nov 29, 2023

do you agree that DTD parsing can be moved to it's own crate, or you prefer to keep it as a module inside quick-xml?

The only reason I see for splitting functionality into a separate crate would be if it has significant utility as a library independent from the original library, or perhaps if it makes a significant difference in compile times.

It's not enough code for compile time to be a major concern and contains no additional dependencies, and I struggle to see a DTD library being useful independent of an XML library.

I suppose if it could be easily used with a different XML library, then that might be a sufficient reason? But otherwise I would just keep it as a module.

//! A parser for encoding detection using BOM and heuristics.

/// A result of feeding data into [`BomParser`].
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
Copy link
Collaborator

@dralley dralley Dec 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, one of the sticking points on the encoding PR is/was that if you wrap the input stream of bytes in an decoding wrapper, then the BOM will always be translated to UTF-8 before the XML parser gets a chance to take a look at it.

It's a thoroughly fixable issue, but it means you have to either

A) poke a hole in the decoder to get a look at raw bytes and set the encoding manually from the outside, or

B) move the encoding detection heuristics to the inner decoding wrapper, or

C) read a couple of bytes, determine the encoding, set the encoding on the decoder from outside, and then replay the bytes previously read

All 3 required altering encoding-rs-io before moving forwards, or forking.

@dralley
Copy link
Collaborator

dralley commented Dec 2, 2023

It's not enough code for compile time to be a major concern and contains no additional dependencies, and I struggle to see a DTD library being useful independent of an XML library.

Replying to myself here: you mentioned both of these in your comment but I forgot to properly follow up on that.

You said you actually did notice a compile time difference - how much?

I saw a few crates such as https://crates.io/crates/dtd-parser which do DTD parsing, but I didn't look into how good or "correct" they are.

Honestly, I don't care that much either way whether it is separate crate or module in quick-xml. I have a general preference for doing the simple thing initially unless there's an compelling reason not to, though.

Mingun and others added 18 commits December 2, 2023 23:05
New parser decouples reading bytes from parsing
XmlSource will be removed soon, so cleaning up in advance
failures (30):
    syntax::cdata::unclosed03::async_tokio
    syntax::cdata::unclosed04::async_tokio
    syntax::cdata::unclosed06::async_tokio
    syntax::cdata::unclosed07::async_tokio
    syntax::cdata::unclosed09::async_tokio
    syntax::cdata::unclosed10::async_tokio
    syntax::cdata::unclosed12::async_tokio
    syntax::cdata::unclosed13::async_tokio
    syntax::cdata::unclosed15::async_tokio
    syntax::cdata::unclosed16::async_tokio
    syntax::cdata::unclosed18::async_tokio
    syntax::cdata::unclosed19::async_tokio
    syntax::comment::unclosed03::async_tokio
    syntax::comment::unclosed04::async_tokio
    syntax::doctype::unclosed03::async_tokio
    syntax::doctype::unclosed04::async_tokio
    syntax::doctype::unclosed06::async_tokio
    syntax::doctype::unclosed07::async_tokio
    syntax::doctype::unclosed09::async_tokio
    syntax::doctype::unclosed10::async_tokio
    syntax::doctype::unclosed12::async_tokio
    syntax::doctype::unclosed13::async_tokio
    syntax::doctype::unclosed15::async_tokio
    syntax::doctype::unclosed16::async_tokio
    syntax::doctype::unclosed18::async_tokio
    syntax::doctype::unclosed19::async_tokio
    syntax::unclosed_bang1::async_tokio
    syntax::unclosed_bang2::async_tokio
    syntax::unclosed_bang3::async_tokio
    syntax::unclosed_bang4::async_tokio
failures (60):
    syntax::cdata::unclosed03::async_tokio
    syntax::cdata::unclosed03::buffered
    syntax::cdata::unclosed04::async_tokio
    syntax::cdata::unclosed04::buffered
    syntax::cdata::unclosed06::async_tokio
    syntax::cdata::unclosed06::buffered
    syntax::cdata::unclosed07::async_tokio
    syntax::cdata::unclosed07::buffered
    syntax::cdata::unclosed09::async_tokio
    syntax::cdata::unclosed09::buffered
    syntax::cdata::unclosed10::async_tokio
    syntax::cdata::unclosed10::buffered
    syntax::cdata::unclosed12::async_tokio
    syntax::cdata::unclosed12::buffered
    syntax::cdata::unclosed13::async_tokio
    syntax::cdata::unclosed13::buffered
    syntax::cdata::unclosed15::async_tokio
    syntax::cdata::unclosed15::buffered
    syntax::cdata::unclosed16::async_tokio
    syntax::cdata::unclosed16::buffered
    syntax::cdata::unclosed18::async_tokio
    syntax::cdata::unclosed18::buffered
    syntax::cdata::unclosed19::async_tokio
    syntax::cdata::unclosed19::buffered
    syntax::comment::unclosed03::async_tokio
    syntax::comment::unclosed03::buffered
    syntax::comment::unclosed04::async_tokio
    syntax::comment::unclosed04::buffered
    syntax::doctype::unclosed03::async_tokio
    syntax::doctype::unclosed03::buffered
    syntax::doctype::unclosed04::async_tokio
    syntax::doctype::unclosed04::buffered
    syntax::doctype::unclosed06::async_tokio
    syntax::doctype::unclosed06::buffered
    syntax::doctype::unclosed07::async_tokio
    syntax::doctype::unclosed07::buffered
    syntax::doctype::unclosed09::async_tokio
    syntax::doctype::unclosed09::buffered
    syntax::doctype::unclosed10::async_tokio
    syntax::doctype::unclosed10::buffered
    syntax::doctype::unclosed12::async_tokio
    syntax::doctype::unclosed12::buffered
    syntax::doctype::unclosed13::async_tokio
    syntax::doctype::unclosed13::buffered
    syntax::doctype::unclosed15::async_tokio
    syntax::doctype::unclosed15::buffered
    syntax::doctype::unclosed16::async_tokio
    syntax::doctype::unclosed16::buffered
    syntax::doctype::unclosed18::async_tokio
    syntax::doctype::unclosed18::buffered
    syntax::doctype::unclosed19::async_tokio
    syntax::doctype::unclosed19::buffered
    syntax::unclosed_bang1::async_tokio
    syntax::unclosed_bang1::buffered
    syntax::unclosed_bang2::async_tokio
    syntax::unclosed_bang2::buffered
    syntax::unclosed_bang3::async_tokio
    syntax::unclosed_bang3::buffered
    syntax::unclosed_bang4::async_tokio
    syntax::unclosed_bang4::buffered
failures (90):
    syntax::cdata::unclosed03::async_tokio
    syntax::cdata::unclosed03::borrowed
    syntax::cdata::unclosed03::buffered
    syntax::cdata::unclosed04::async_tokio
    syntax::cdata::unclosed04::borrowed
    syntax::cdata::unclosed04::buffered
    syntax::cdata::unclosed06::async_tokio
    syntax::cdata::unclosed06::borrowed
    syntax::cdata::unclosed06::buffered
    syntax::cdata::unclosed07::async_tokio
    syntax::cdata::unclosed07::borrowed
    syntax::cdata::unclosed07::buffered
    syntax::cdata::unclosed09::async_tokio
    syntax::cdata::unclosed09::borrowed
    syntax::cdata::unclosed09::buffered
    syntax::cdata::unclosed10::async_tokio
    syntax::cdata::unclosed10::borrowed
    syntax::cdata::unclosed10::buffered
    syntax::cdata::unclosed12::async_tokio
    syntax::cdata::unclosed12::borrowed
    syntax::cdata::unclosed12::buffered
    syntax::cdata::unclosed13::async_tokio
    syntax::cdata::unclosed13::borrowed
    syntax::cdata::unclosed13::buffered
    syntax::cdata::unclosed15::async_tokio
    syntax::cdata::unclosed15::borrowed
    syntax::cdata::unclosed15::buffered
    syntax::cdata::unclosed16::async_tokio
    syntax::cdata::unclosed16::borrowed
    syntax::cdata::unclosed16::buffered
    syntax::cdata::unclosed18::async_tokio
    syntax::cdata::unclosed18::borrowed
    syntax::cdata::unclosed18::buffered
    syntax::cdata::unclosed19::async_tokio
    syntax::cdata::unclosed19::borrowed
    syntax::cdata::unclosed19::buffered
    syntax::comment::unclosed03::async_tokio
    syntax::comment::unclosed03::borrowed
    syntax::comment::unclosed03::buffered
    syntax::comment::unclosed04::async_tokio
    syntax::comment::unclosed04::borrowed
    syntax::comment::unclosed04::buffered
    syntax::doctype::unclosed03::async_tokio
    syntax::doctype::unclosed03::borrowed
    syntax::doctype::unclosed03::buffered
    syntax::doctype::unclosed04::async_tokio
    syntax::doctype::unclosed04::borrowed
    syntax::doctype::unclosed04::buffered
    syntax::doctype::unclosed06::async_tokio
    syntax::doctype::unclosed06::borrowed
    syntax::doctype::unclosed06::buffered
    syntax::doctype::unclosed07::async_tokio
    syntax::doctype::unclosed07::borrowed
    syntax::doctype::unclosed07::buffered
    syntax::doctype::unclosed09::async_tokio
    syntax::doctype::unclosed09::borrowed
    syntax::doctype::unclosed09::buffered
    syntax::doctype::unclosed10::async_tokio
    syntax::doctype::unclosed10::borrowed
    syntax::doctype::unclosed10::buffered
    syntax::doctype::unclosed12::async_tokio
    syntax::doctype::unclosed12::borrowed
    syntax::doctype::unclosed12::buffered
    syntax::doctype::unclosed13::async_tokio
    syntax::doctype::unclosed13::borrowed
    syntax::doctype::unclosed13::buffered
    syntax::doctype::unclosed15::async_tokio
    syntax::doctype::unclosed15::borrowed
    syntax::doctype::unclosed15::buffered
    syntax::doctype::unclosed16::async_tokio
    syntax::doctype::unclosed16::borrowed
    syntax::doctype::unclosed16::buffered
    syntax::doctype::unclosed18::async_tokio
    syntax::doctype::unclosed18::borrowed
    syntax::doctype::unclosed18::buffered
    syntax::doctype::unclosed19::async_tokio
    syntax::doctype::unclosed19::borrowed
    syntax::doctype::unclosed19::buffered
    syntax::unclosed_bang1::async_tokio
    syntax::unclosed_bang1::borrowed
    syntax::unclosed_bang1::buffered
    syntax::unclosed_bang2::async_tokio
    syntax::unclosed_bang2::borrowed
    syntax::unclosed_bang2::buffered
    syntax::unclosed_bang3::async_tokio
    syntax::unclosed_bang3::borrowed
    syntax::unclosed_bang3::buffered
    syntax::unclosed_bang4::async_tokio
    syntax::unclosed_bang4::borrowed
    syntax::unclosed_bang4::buffered
(review with with whitespace ignored mode)
(review with with whitespace ignored mode)
@Mingun
Copy link
Collaborator Author

Mingun commented Dec 2, 2023

You said you actually did notice a compile time difference - how much?

I was not clear enough. The point is that if create a new module inside quick-xml then I need to wait until the whole quick-xml will be compiled before I can see the difference about changes that I've made in a new module. quick-xml now takes about 10 seconds to compile on my machine. The new crate takes only 1-2 seconds. Therefore, it is simply more convenient to develop a very autonomous part of the parser in a separate crate.

I've seen https://crates.io/crates/dtd-parser and while it uses high-performant nom library I decided to not use it because:

  • it is not allocation-free. Read data is stored as Strings which is not right for us if we want to be fast
  • I think, it has too many dependencies for a such simple task. quick-xml right now has only one required dependency without transitive dependencies and I think this is great. I would not like to increase this number too much

Honestly, I don't care that much either way whether it is separate crate or module in quick-xml. I have a general preference for doing the simple thing initially unless there's an compelling reason not to, though.

Ok. Then I think I keep the current approach and implement DTD parser in it's own crate for now. Then it will be simpler to enhance it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement optimization Issues related to reducing time needed to parse XML or to memory consumption
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants