-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow attributes in the Event::End
and fix .error_position()
#780
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #780 +/- ##
==========================================
- Coverage 61.81% 60.01% -1.81%
==========================================
Files 41 41
Lines 16798 16048 -750
==========================================
- Hits 10384 9631 -753
- Misses 6414 6417 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
src/reader/mod.rs
Outdated
// do. This is still invalid tag but it also has no practical meaning | ||
// to parse it in a such way. End tag `</tag attr=">" >`, however, can | ||
// be produced by some real-world parsers, for example, by the parser | ||
// of the Macromedia Flash player. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't allow this by default, we should put it behind an opt-in. Even non-validating parsers check for well-formedness, and this is malformed XML.
expat for example blows up
ExpatError: not well-formed (invalid token): line 1, column 11
as does Lxml and libxml2
XMLSyntaxError: expected '>', line 1, column 12
parserError: xmlParseDoc() failed
I'm OK with making this possible to support, but our default behavior should probably be throwing an IllFormedError
just like we do for mismatched tags (and like other parsers do).
ExpatError: mismatched tag: line 1, column 7
Also line 358 should be rewritten - XML isn't "produced" by a parser. You can just say that such documents exist in the wild and are tolerated by some parsers such as the one used by Adobe Flash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to evolve library in this direction:
- rename
Reader
toRawReader
(andEvent
toRawEvent
) - make
RawReader
as liberal as possible. It should not make any validation checks - add a
validate
method toRawEvent
which will perform all well-formedless and validity checks (I already grab list of all checks from the specification, 35 VC + 16 WFC references, actually most of them about DTD). All validation must be run manually by callingRawEvent::validate
. Only one check cannot be performed in this way: check that start tag name matches end tag name - move existing checks (for example,
IllFormedError::DoubleHyphenInComment
) to the newvalidate
method - introduce a new
Reader
which will perform validation checks when flag in config is set (and process many other things, like dealing with encodings and expanding entity references as shown in thecustom_entities
example in Rework handling general entity references (&entity;
) #766). Add necessary config flags
I haven't thought through all the details yet and don't want to focus on them yet. There will be a separate PR where all this will be considered. It should be pretty soon (this year).
So, given that we are not currently performing most validity checks and I am planning a generic consistent solution for them, do you agree that it is not worth introducing them now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also line 358 should be rewritten - XML isn't "produced" by a parser. You can just say that such documents exist in the wild and are tolerated by some parsers such as the one used by Adobe Flash.
Any help are appreciated 😃 . I feel that my English is no good to writing simple-to-read explanations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all sounds good to me, but we probably shouldn't document it as a "new feature" given that we do want this to eventually fail for users of the Reader
API.
This is not generic code and it is more clear when we construct them with explicit state
If test will fail due to another syntax error we get a nice diff
Check position for zero actually does not check anything because this is start position before parse. We move position by 1 by using Text event failures (36): syntax::tag::unclosed3::ns_reader::async_tokio syntax::tag::unclosed3::ns_reader::borrowed syntax::tag::unclosed3::ns_reader::buffered syntax::tag::unclosed3::reader::async_tokio syntax::tag::unclosed3::reader::borrowed syntax::tag::unclosed3::reader::buffered syntax::tag::unclosed4::ns_reader::async_tokio syntax::tag::unclosed4::ns_reader::borrowed syntax::tag::unclosed4::ns_reader::buffered syntax::tag::unclosed4::reader::async_tokio syntax::tag::unclosed4::reader::borrowed syntax::tag::unclosed4::reader::buffered syntax::tag::unclosed5::ns_reader::async_tokio syntax::tag::unclosed5::ns_reader::borrowed syntax::tag::unclosed5::ns_reader::buffered syntax::tag::unclosed5::reader::async_tokio syntax::tag::unclosed5::reader::borrowed syntax::tag::unclosed5::reader::buffered syntax::tag::unclosed6::ns_reader::async_tokio syntax::tag::unclosed6::ns_reader::borrowed syntax::tag::unclosed6::ns_reader::buffered syntax::tag::unclosed6::reader::async_tokio syntax::tag::unclosed6::reader::borrowed syntax::tag::unclosed6::reader::buffered syntax::tag::unclosed7::ns_reader::async_tokio syntax::tag::unclosed7::ns_reader::borrowed syntax::tag::unclosed7::ns_reader::buffered syntax::tag::unclosed7::reader::async_tokio syntax::tag::unclosed7::reader::borrowed syntax::tag::unclosed7::reader::buffered syntax::tag::unclosed8::ns_reader::async_tokio syntax::tag::unclosed8::ns_reader::borrowed syntax::tag::unclosed8::ns_reader::buffered syntax::tag::unclosed8::reader::async_tokio syntax::tag::unclosed8::reader::borrowed syntax::tag::unclosed8::reader::buffered
…open or self-closed tag
…here error position is reported `.error_position()` is intended to report position where error start occurring, `.buffer_position()` is intended to show to what position reader read data tests/issues.rs leave untouched because this is the code from real problems in GH issues, it should remain as close to issue as possible
src/reader/mod.rs
Outdated
// optional validate step which user should call manually | ||
// - if we just look for `>` we will parse `</tag attr=">" >` as end tag | ||
// `</tag attr=">` and text `" >` which probably no one existing parser | ||
// do. This is still invalid tag but it also has no practical meaning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do -> does
This is malformed XML, however it is tolerated by some parsers
(e.g. the one used by Adobe Flash) and such documents do exist in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
Task CIFuzz was failed by unknown reasons -- I asked google/oss-fuzz#12168 but no response yet... It not related to this changes, so merge |
This PR allows us to have attributes in the
Event::End
(but they are not accessible viaBytesEnd
directly). As explained in #776 (comment) we anyway can produce an event which.name()
does not valid, so it is better ar least keep it useful for some scenarios. When validation API would be added (I'm working on it) we can validate all events and report an error.Closes #776
When changing parser I noticed, that in one arm we do not set
last_error_offset
. I checked if we really cover that lines -- yes, but we actually does not check anything.last_error_offset
was checked for0
, but this is its default value, so was unclear is that code running or not. I improved check and fix the bug that was found due to that.