-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow �
values
#496
base: master
Are you sure you want to change the base?
Allow �
values
#496
Conversation
Do not return error on parsing preperly escaped zero values. Add simple test to check it.
Codecov Report
@@ Coverage Diff @@
## master #496 +/- ##
==========================================
- Coverage 61.70% 61.67% -0.04%
==========================================
Files 33 33
Lines 15343 15337 -6
==========================================
- Hits 9468 9459 -9
- Misses 5875 5878 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
I'll close this PR and will vendor a fork for internal needs then. |
Do your internal needs require handling a lot of documents with that issue? Can you speak about the source & purpose of said documents? The "API concerns" I have to imagine are related to C-strings (with null termination) and not strictly relevant to Rust. However, there would be nothing preventing Rust from reading such documents, even though we shouldn't allow them to be output. (That is not to say that |
There are lots of huge CPU hardware-related documents (dozens of gigabytes), and there are properly escaped nulls. Minor example of the same kind is Apple's keyboard layout XML. In both cases I have to read those XMLs exclusively by Rust, and I never write XML, so in my practice I would never face an error on attempt to write NULL. On the other hand, I may put read-write for NULL under a feature flag to allow this use case when a user (like me or my engineers) sure they needs that. |
Yes, I actually does not understand, why XML forbid character references to some characters in the first place, and I'm glad to allow them. This ability under a feature flag would be welcome. @sashka, can you add a properly documented feature flag (say, |
Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do. It looks like nokogiri may have at one point just stripped these characters out instead of raising an error. However, attempting to write XML with those characters will raise a fatal error. On the other hand it looks like both Python's ElementTree and libxml2 raise a hard error even while parsing
I believe Python's XML parsing is based on Expat, so between that and libxml2, that covers 2 of the most-used XML libraries. Are the NUL characters in these documents "load-bearing"? Would they be unsafe to strip out? |
The following behaviors are possible:
Each behavior could be applied to a raw or an escaped form of characters. I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them. I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement. The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now. Any thoughts? |
Actually, you've already open it -- #368. |
Surprisingly, the only restricted character which is actually restricted is the escaped NULL: fn test_restricted_chars_unescape() {
assert_eq!(unescape("\u{0}").unwrap(), "\u{0}");
// assert_eq!(unescape("�").unwrap(), "\u{0}"); // <-- this assert will fail with version 0.26.0
assert_eq!(unescape("\u{1}").unwrap(), "\u{1}");
assert_eq!(unescape("").unwrap(), "\u{1}");
assert_eq!(unescape("\u{2}").unwrap(), "\u{2}");
assert_eq!(unescape("").unwrap(), "\u{2}");
assert_eq!(unescape("\u{3}").unwrap(), "\u{3}");
assert_eq!(unescape("").unwrap(), "\u{3}");
assert_eq!(unescape("\u{4}").unwrap(), "\u{4}");
assert_eq!(unescape("").unwrap(), "\u{4}");
assert_eq!(unescape("\u{5}").unwrap(), "\u{5}");
assert_eq!(unescape("").unwrap(), "\u{5}");
assert_eq!(unescape("\u{6}").unwrap(), "\u{6}");
assert_eq!(unescape("").unwrap(), "\u{6}");
assert_eq!(unescape("\u{7}").unwrap(), "\u{7}");
assert_eq!(unescape("").unwrap(), "\u{7}");
assert_eq!(unescape("\u{8}").unwrap(), "\u{8}");
assert_eq!(unescape("").unwrap(), "\u{8}");
assert_eq!(unescape("\u{b}").unwrap(), "\u{b}");
assert_eq!(unescape("").unwrap(), "\u{b}");
assert_eq!(unescape("\u{c}").unwrap(), "\u{c}");
assert_eq!(unescape("").unwrap(), "\u{c}");
assert_eq!(unescape("\u{e}").unwrap(), "\u{e}");
assert_eq!(unescape("").unwrap(), "\u{e}");
assert_eq!(unescape("\u{f}").unwrap(), "\u{f}");
assert_eq!(unescape("").unwrap(), "\u{f}");
assert_eq!(unescape("\u{10}").unwrap(), "\u{10}");
assert_eq!(unescape("").unwrap(), "\u{10}");
assert_eq!(unescape("\u{11}").unwrap(), "\u{11}");
assert_eq!(unescape("").unwrap(), "\u{11}");
assert_eq!(unescape("\u{12}").unwrap(), "\u{12}");
assert_eq!(unescape("").unwrap(), "\u{12}");
assert_eq!(unescape("\u{13}").unwrap(), "\u{13}");
assert_eq!(unescape("").unwrap(), "\u{13}");
assert_eq!(unescape("\u{14}").unwrap(), "\u{14}");
assert_eq!(unescape("").unwrap(), "\u{14}");
assert_eq!(unescape("\u{15}").unwrap(), "\u{15}");
assert_eq!(unescape("").unwrap(), "\u{15}");
assert_eq!(unescape("\u{16}").unwrap(), "\u{16}");
assert_eq!(unescape("").unwrap(), "\u{16}");
assert_eq!(unescape("\u{17}").unwrap(), "\u{17}");
assert_eq!(unescape("").unwrap(), "\u{17}");
assert_eq!(unescape("\u{18}").unwrap(), "\u{18}");
assert_eq!(unescape("").unwrap(), "\u{18}");
assert_eq!(unescape("\u{19}").unwrap(), "\u{19}");
assert_eq!(unescape("").unwrap(), "\u{19}");
assert_eq!(unescape("\u{1a}").unwrap(), "\u{1a}");
assert_eq!(unescape("").unwrap(), "\u{1a}");
assert_eq!(unescape("\u{1b}").unwrap(), "\u{1b}");
assert_eq!(unescape("").unwrap(), "\u{1b}");
assert_eq!(unescape("\u{1c}").unwrap(), "\u{1c}");
assert_eq!(unescape("").unwrap(), "\u{1c}");
assert_eq!(unescape("\u{1d}").unwrap(), "\u{1d}");
assert_eq!(unescape("").unwrap(), "\u{1d}");
assert_eq!(unescape("\u{1e}").unwrap(), "\u{1e}");
assert_eq!(unescape("").unwrap(), "\u{1e}");
assert_eq!(unescape("\u{1f}").unwrap(), "\u{1f}");
assert_eq!(unescape("").unwrap(), "\u{1f}");
assert_eq!(unescape("\u{7f}").unwrap(), "\u{7f}");
assert_eq!(unescape("").unwrap(), "\u{7f}");
assert_eq!(unescape("\u{80}").unwrap(), "\u{80}");
assert_eq!(unescape("€").unwrap(), "\u{80}");
assert_eq!(unescape("\u{81}").unwrap(), "\u{81}");
assert_eq!(unescape("").unwrap(), "\u{81}");
assert_eq!(unescape("\u{82}").unwrap(), "\u{82}");
assert_eq!(unescape("‚").unwrap(), "\u{82}");
assert_eq!(unescape("\u{83}").unwrap(), "\u{83}");
assert_eq!(unescape("ƒ").unwrap(), "\u{83}");
assert_eq!(unescape("\u{84}").unwrap(), "\u{84}");
assert_eq!(unescape("„").unwrap(), "\u{84}");
assert_eq!(unescape("\u{86}").unwrap(), "\u{86}");
assert_eq!(unescape("†").unwrap(), "\u{86}");
assert_eq!(unescape("\u{87}").unwrap(), "\u{87}");
assert_eq!(unescape("‡").unwrap(), "\u{87}");
assert_eq!(unescape("\u{88}").unwrap(), "\u{88}");
assert_eq!(unescape("ˆ").unwrap(), "\u{88}");
assert_eq!(unescape("\u{89}").unwrap(), "\u{89}");
assert_eq!(unescape("‰").unwrap(), "\u{89}");
assert_eq!(unescape("\u{8a}").unwrap(), "\u{8a}");
assert_eq!(unescape("Š").unwrap(), "\u{8a}");
assert_eq!(unescape("\u{8b}").unwrap(), "\u{8b}");
assert_eq!(unescape("‹").unwrap(), "\u{8b}");
assert_eq!(unescape("\u{8c}").unwrap(), "\u{8c}");
assert_eq!(unescape("Œ").unwrap(), "\u{8c}");
assert_eq!(unescape("\u{8d}").unwrap(), "\u{8d}");
assert_eq!(unescape("").unwrap(), "\u{8d}");
assert_eq!(unescape("\u{8e}").unwrap(), "\u{8e}");
assert_eq!(unescape("Ž").unwrap(), "\u{8e}");
assert_eq!(unescape("\u{8f}").unwrap(), "\u{8f}");
assert_eq!(unescape("").unwrap(), "\u{8f}");
assert_eq!(unescape("\u{90}").unwrap(), "\u{90}");
assert_eq!(unescape("").unwrap(), "\u{90}");
assert_eq!(unescape("\u{91}").unwrap(), "\u{91}");
assert_eq!(unescape("‘").unwrap(), "\u{91}");
assert_eq!(unescape("\u{92}").unwrap(), "\u{92}");
assert_eq!(unescape("’").unwrap(), "\u{92}");
assert_eq!(unescape("\u{93}").unwrap(), "\u{93}");
assert_eq!(unescape("“").unwrap(), "\u{93}");
assert_eq!(unescape("\u{94}").unwrap(), "\u{94}");
assert_eq!(unescape("”").unwrap(), "\u{94}");
assert_eq!(unescape("\u{95}").unwrap(), "\u{95}");
assert_eq!(unescape("•").unwrap(), "\u{95}");
assert_eq!(unescape("\u{96}").unwrap(), "\u{96}");
assert_eq!(unescape("–").unwrap(), "\u{96}");
assert_eq!(unescape("\u{97}").unwrap(), "\u{97}");
assert_eq!(unescape("—").unwrap(), "\u{97}");
assert_eq!(unescape("\u{98}").unwrap(), "\u{98}");
assert_eq!(unescape("˜").unwrap(), "\u{98}");
assert_eq!(unescape("\u{99}").unwrap(), "\u{99}");
assert_eq!(unescape("™").unwrap(), "\u{99}");
assert_eq!(unescape("\u{9a}").unwrap(), "\u{9a}");
assert_eq!(unescape("š").unwrap(), "\u{9a}");
assert_eq!(unescape("\u{9b}").unwrap(), "\u{9b}");
assert_eq!(unescape("›").unwrap(), "\u{9b}");
assert_eq!(unescape("\u{9c}").unwrap(), "\u{9c}");
assert_eq!(unescape("œ").unwrap(), "\u{9c}");
assert_eq!(unescape("\u{9d}").unwrap(), "\u{9d}");
assert_eq!(unescape("").unwrap(), "\u{9d}");
assert_eq!(unescape("\u{9e}").unwrap(), "\u{9e}");
assert_eq!(unescape("ž").unwrap(), "\u{9e}");
assert_eq!(unescape("\u{9f}").unwrap(), "\u{9f}");
assert_eq!(unescape("Ÿ").unwrap(), "\u{9f}");
} |
It appears that C#'s dotnet XML library has a flag for this: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreadersettings.checkcharacters?view=net-6.0 Same for writing: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlwritersettings.checkcharacters?view=net-6.0
I'm not surprised given I filed #368. It's something we needed to improve anyways. Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious. |
I'll keep all the work for
No, vendors don't care about general XML validity as soon as their XML files are not for public distribution, and their own tools process their own files perfectly. |
Agree that we shouldn't skip over invalid data. Seems like a terrible approach.
An initial conservative approach could be to just provide an advanced version of the various escape / unescape functions which would allow you to configure the behavior either way at runtime. I think I would prefer that approach over a feature flag. Agreed that we're still missing the checks for these characters in the raw form and that we ought to have them. Reading through the MS docs again, it appears that |
Does that sound right @sashka ? |
@dralley, yes, it sounds right for me. I'm working on proper reading mechanics now. |
@sashka Do you plan to continue with this? |
Do not return error on parsing preperly escaped zero values, which are perfectly valid. Add simple test to check it.