Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[work-in-progress] Decoding BufReader implementation #441

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

dralley
Copy link
Collaborator

@dralley dralley commented Jul 25, 2022

Very, very work-in-progress. Doesn't compile yet. Only posting it for transparency.

@Mingun

This comment was marked as outdated.

@dralley

This comment was marked as outdated.

@dralley

This comment was marked as outdated.

@Mingun

This comment was marked as outdated.

@dralley

This comment was marked as outdated.

@codecov-commenter
Copy link

codecov-commenter commented Aug 14, 2022

Codecov Report

Merging #441 (10038ca) into master (90be3ee) will increase coverage by 0.69%.
The diff coverage is 83.63%.

@@            Coverage Diff             @@
##           master     #441      +/-   ##
==========================================
+ Coverage   52.12%   52.81%   +0.69%     
==========================================
  Files          28       28              
  Lines       13480    13392      -88     
==========================================
+ Hits         7026     7073      +47     
+ Misses       6454     6319     -135     
Flag Coverage Δ
unittests 52.81% <83.63%> (+0.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
benches/microbenches.rs 0.00% <0.00%> (ø)
examples/custom_entities.rs 0.00% <0.00%> (ø)
examples/nested_readers.rs 0.00% <0.00%> (ø)
examples/read_buffered.rs 0.00% <0.00%> (ø)
examples/read_texts.rs 0.00% <0.00%> (ø)
src/errors.rs 2.08% <0.00%> (+0.01%) ⬆️
src/reader/buffered_reader.rs 69.89% <0.00%> (-6.25%) ⬇️
src/writer.rs 48.30% <0.00%> (-0.28%) ⬇️
src/de/var.rs 75.00% <11.11%> (-10.11%) ⬇️
src/de/escape.rs 19.84% <40.00%> (-1.21%) ⬇️
... and 26 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@dralley dralley force-pushed the decoding-reader branch 3 times, most recently from e618b63 to 10038ca Compare August 14, 2022 05:39
@Mingun

This comment was marked as outdated.

@Mingun

This comment was marked as outdated.

@dralley dralley force-pushed the decoding-reader branch 5 times, most recently from 547162f to 1a69f9b Compare August 15, 2022 04:24
Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea to create a separate test file for encodings, so the commit 2 is good for me. The others lets hold until your finish your work. I'm not sure, that we should remove StartText event, because it still can be useful even if will not contain BOM. The text before the declaration is not considered as a part of XML model, so it should be treated differently, for example, the unescaping of that text has no meaning.

I suggest firstly to add tests for all possible encodings -- because the set of encodings is finite and not so big, we can create a bunch of documents in each encoding and test their parsing. The tests that I would like to see:

  • detecting encoding (I found that we do not do that correctly, I'll plan to look at this before release, I already made some investigations and improvements, but not yet proved them with tests)
  • reading events

The XMLs should looks like:

<?xml encoding="<encoding>"?>
<root attribute1=value1
      attribute2="value1"
      attribute3='value2'
>
  <?instruction?>
  <!--Comment-->
  text
  <ns:element ns:attribute="value1" xmlns:ns="namespace"/>
  <[[CDATA[[cdata]]>
</root>

Because almost all supported encodings are one-byte encodings, we could create XML files with all possible symbols (probably except that explicitly forbidden by the XML standard). The national symbols should be presented in all possible places (tag names, namespace names, attribute names and values, text and CDATA content, comments, processing instructions).

Only after that, I believe, we should made any improvements in encoding support.

tests/encodings.rs Outdated Show resolved Hide resolved
@dralley

This comment was marked as outdated.

@dralley

This comment was marked as outdated.

@dralley dralley force-pushed the decoding-reader branch 2 times, most recently from 61022d8 to 769309f Compare August 15, 2022 21:51
@dralley

This comment was marked as outdated.

@Mingun

This comment was marked as outdated.

@dralley dralley force-pushed the decoding-reader branch 2 times, most recently from fd55c87 to 4a61af5 Compare August 29, 2022 01:30
@dralley
Copy link
Collaborator Author

dralley commented Aug 29, 2022

The basic functionality is pretty much working. TODO:

  • encoding detection regression for UTF-16 (failing tests)
  • async support in Utf8BytesReader
  • UTF-8 validation inside Utf8BytesReader when the encoding feature is turned off
  • some other cleanup

Async support is the real pain unfortunately.

@dralley dralley force-pushed the decoding-reader branch 2 times, most recently from 2ba3aa9 to 247d709 Compare August 29, 2022 02:17
impl<R> Reader<R> {
/// Consumes `Reader` returning the underlying reader
///
/// Can be used to compute line and column of a parsing error position
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we really need a better solution than this. Either by tracking newlines during parsing or by automatically implementing something like this for &str and perhaps File.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the overhead of the former approach, might be too expensive.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some preliminary comments

Cargo.toml Outdated
@@ -39,7 +39,7 @@ name = "macrobenches"
harness = false

[features]
default = []
default = ["encoding"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we could make this default.

Changelog.md Show resolved Hide resolved
@@ -1612,7 +1588,7 @@ mod test {
/// character should be stripped for consistency
#[$test]
$($async)? fn bom_from_reader() {
let mut reader = Reader::from_reader("\u{feff}\u{feff}".as_bytes());
let mut reader = Reader::from_str("\u{feff}\u{feff}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is intentionally uses from_reader (which its name say). Changing that you make it the same as the test just below it. You should either delete this change or the entire test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it needs to be removed, because this variant of this method cannot be implemented for SliceReader.

@dralley
Copy link
Collaborator Author

dralley commented Sep 5, 2022

Status update: I'm still working on this, but making PRs against https://github.com/BurntSushi/encoding_rs_io/

@dralley
Copy link
Collaborator Author

dralley commented Sep 14, 2022

And also hsivonen/encoding_rs#88

I'm not sure how long I'm going to be blocked on that, so if there is a portion of the work you think could be split off and merged, I'd appreciate that. Otherwise I'll probably work on the attribute normalization PR.

@dralley
Copy link
Collaborator Author

dralley commented Sep 23, 2022

So, as a practical matter I now think we need to strip the BOM in all situations, no matter what.

Not because there's anything invalid about it at a purely technical level, but because buried deep down on page 157 out of 1054 of the Unicode standard it says that for UTF-16 the BOM "should not be considered part of the text". It actually says nothing about that for UTF-8 and in fact implies the opposite, but that line appears to have been the basis for encoding_rs to remove the BOM from the stream for both UTF-16 and UTF-8.

Realistically I don't think we should cut against the grain of what encoding_rs is doing, it's probably a sane decision for consistency purposes anyway, and we need to get the same output in both the encoding and not-encoding cases - so that means the BOM should never be present.

@Mingun
Copy link
Collaborator

Mingun commented Sep 23, 2022

Do what you think right, just explain decisions in the comments in the code

@dralley
Copy link
Collaborator Author

dralley commented Oct 26, 2022

@Mingun I'm just going to rip out any changes to the serde code and do that absolutely last, so I don't need to keep redoing it

It's been a month without any further progress on getting BurntSushi/encoding_rs_io#11 merged, so I'm just going to vendor the code internally for now. It's not that much code and he was very skeptical about merging async support anyway, which we're kind of obligated to support now.

This code wouldn't have worked on any other encoding anyway
When the source of the bytes isn't UTF-8 (or isn't known to be), the
bytes need to be decoded first, or at least validated as such. Wrap
'Read'ers with Utf8BytesReader to ensure this happens.

Defer the validating portion for now.
This is inside the check! macro, meaning it gets implemented per reader
type, but from_reader() doesn't apply to SliceReader
As quick-xml will pre-decode everything, it is unnecessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding Issues related to support of various encodings of the XML documents enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants