Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalize_line_end for unescape and test #807

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@

### New Features

- [#806]: Added `normalize_line_end` that normalizes line end
from original XML data during unescape process.
yorkz1994 marked this conversation as resolved.
Show resolved Hide resolved

[#806]: https://github.com/tafia/quick-xml/issues/806

### Bug Fixes

### Misc Changes
Expand Down
34 changes: 31 additions & 3 deletions src/escape.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
//! Manage xml character escapes

use memchr::memchr2_iter;
use memchr::{memchr2_iter, memchr_iter};
use std::borrow::Cow;
use std::num::ParseIntError;
use std::ops::Range;
Expand Down Expand Up @@ -266,7 +266,7 @@ where
unescaped = Some(String::with_capacity(raw.len()));
}
let unescaped = unescaped.as_mut().expect("initialized");
unescaped.push_str(&raw[last_end..start]);
unescaped.push_str(&normalize_line_end(&raw[last_end..start]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Escaping may be on the hot path, so it is not fine that we can allocate new owned strings just to get a reference and put it to already allocated string. Would it possible instead use already managed Cow and pass it to the normalize_line_end? Then upgrading Cow from borrowed to owned could be performed either in unescape, or in normalize_line_end and that upgrade always will be performed only once.

Also, do we really need two loops? memchr can search for three bytes at once and we search for three bytes: &, ; and \r. So it could be possible to perform unescaping and line end normalization in one loop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do it. What if there is nothing to unescape? You still need to call normalize_line_end for such input in the end, then what is the correct function parameter for normalize_line_end so that it can handle both situations?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, don't understand what the trouble? unescape will always be called when we request data via API methods. If cycling search of &, ;, or \r will not find anything, than it is really nothing to change in the data (I assume that cycles for unescaping and normalization merged in one).

Even if did not merge loops, normalize_line_end can accept &mut Cow<str> and change it instead of creating an own Cow by taking reference to &str and returning its own Cow<str>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am talking about the not merge loops, as you point out that there are more line end situations to consider, it is beyond my knowledge to do SIMD stuff myself to merge the loops.

Current function signature is: fn normalize_line_end(input: &str) -> Cow<str>
Change to what? How to call it?

Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.

Copy link
Collaborator

@Mingun Mingun Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, read my &mut Cow<str> as &mut Option<String>. So the whole code will look like:

fn unescape_with(...) {
  let mut unescaped = None;
  for ... {
    ...
    if need to unescape {
      if unescaped.is_none() {
        unescaped = Some(String::with_capacity(raw.len()));
      }
      let unescaped = unescaped.as_mut().expect("initialized");
    }
    ...
    normalize_line_end(&mut unescaped, raw.len());
    ...
  }
}

fn normalize_line_end(input: &mut Option<String>, raw_len: usize) {
  if need to normalize {
    if input.is_none() {
       input = Some(String::with_capacity(raw_len))
    }
    let normalized = input.as_mut().expect("initialized");
    ...
  }
}

it is beyond my knowledge to do SIMD stuff myself to merge the loops.

Actually, SIMD and merging loops are two separate conception, it is not necessary to explicitly apply SIMD techniques if you not aware of them, it's totally fine. I only tried to be helpful if you aware of it.

Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.

The requirement in only is not to make things worse. First of all we should not make it impossible to correctly process input (that is why we should at least process \r\u0085 because if we turn it into \n\u0085, the user will see two newlines in data instead of one, because specification allows \n, \r\u0085 and \u0085 be served as (one) newline characters). The second, we should to try that without much impact to performance, so if we can avoid unnecessary allocations we should do that.

If you're having a hard time figuring out how to get there, that's fine, it could be so. Thanks for opening PR anyway. I will consider it later when I return to active phase on the project (currently working on other project) and see how to improve it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for showing the code.
Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?
Then you still have to call normlize_line_end to normalize the original input. How to you put original input raw str as parameter to your current unescape function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?

Unescape function decides where it is need to unescape something or not. It itself called always (well, because user requests unescaped and normalized data by demand, it will be called by demand, but that "demand" means "unesсape [and normalize] if there is anything")

How to you put original input raw str as parameter to your current unescape function?

Cannot understand what confuses you here. Calls to unescape will not be changed and it is already called whenever necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn unescape_with<'input, 'entity, F>(
    raw: &'input str,
    mut resolve_entity: F,
) -> Result<Cow<'input, str>, EscapeError>
where
    // the lifetime of the output comes from a capture or is `'static`
    F: FnMut(&str) -> Option<&'entity str>,
{

Nothing to escape means there is no & and ; in raw input. It will not go inside while below. So normalize will not happen to input here.

while let Some(start) = iter.by_ref().find(|p| bytes[*p] == b'&') {
        match iter.next() {
            Some(end) if bytes[end] == b';' => {
                // append valid data
                if unescaped.is_none() {
                    unescaped = Some(String::with_capacity(raw.len()));
                }
                let unescaped = unescaped.as_mut().expect("initialized");
                unescaped.push_str(&normalize_line_end(&raw[last_end..start]));

                // search for character correctness
                let pat = &raw[start + 1..end];
                if let Some(entity) = pat.strip_prefix('#') {
                    let codepoint = parse_number(entity).map_err(EscapeError::InvalidCharRef)?;
                    unescaped.push_str(codepoint.encode_utf8(&mut [0u8; 4]));
                } else if let Some(value) = resolve_entity(pat) {
                    unescaped.push_str(value);
                } else {
                    return Err(EscapeError::UnrecognizedEntity(
                        start + 1..end,
                        pat.to_string(),
                    ));
                }

                last_end = end + 1;
            }
            _ => return Err(EscapeError::UnterminatedEntity(start..raw.len())),
        }
    }

Will not go inside here either because nothing has been unescaped

if let Some(mut unescaped) = unescaped {    
    if let Some(raw) = raw.get(last_end..) {
        unescaped.push_str(&normalize_line_end(raw));
    }
    Ok(Cow::Owned(unescaped))
} else {

Must run normalize here, so how to call your changed function fn normalize_line_end(input: &mut Option<String>, raw_len: usize) here with raw as parameter?

        Ok(normalize_line_end(raw))
    }
}


// search for character correctness
let pat = &raw[start + 1..end];
Expand All @@ -290,14 +290,42 @@ where

if let Some(mut unescaped) = unescaped {
if let Some(raw) = raw.get(last_end..) {
unescaped.push_str(raw);
unescaped.push_str(&normalize_line_end(raw));
}
Ok(Cow::Owned(unescaped))
} else {
Ok(Cow::Borrowed(raw))
}
}

/// Normalize the line end, replace \r or \r\n with \n.
#[inline]
fn normalize_line_end(input: &str) -> Cow<str> {
let bytes = input.as_bytes();
let mut normalized = None;
let mut start = 0;
let iter = memchr_iter(b'\r', bytes);
for i in iter {
if normalized.is_none() {
normalized = Some(String::with_capacity(input.len()))
}
let normalized = normalized.as_mut().expect("initialized");
normalized.push_str(&input[start..i]);
normalized.push('\n');
start = i + 1;
if matches!(bytes.get(start), Some(&c) if c == b'\n') {
yorkz1994 marked this conversation as resolved.
Show resolved Hide resolved
// \n right after \r, \r\n case, skip \n because we have already replaced \r with \n
start += 1;
}
}
if let Some(mut normalized) = normalized {
normalized.push_str(&input[start..]);
Cow::Owned(normalized)
} else {
input.into()
}
}

/// Resolves predefined XML entities or all HTML5 entities depending on the feature
/// [`escape-html`](https://docs.rs/quick-xml/latest/quick_xml/#escape-html).
///
Expand Down
21 changes: 21 additions & 0 deletions tests/escape.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,27 @@ fn unescape() {
);
}

#[test]
fn unescape_line_end() {
let unchanged = escape::unescape("test\n");
// assert_eq does not check that Cow is borrowed, but we explicitly use Cow
// because it influences diff
// TODO: use assert_matches! when stabilized and other features will bump MSRV
assert_eq!(unchanged, Ok(Cow::Borrowed("test\n")));
assert!(matches!(unchanged, Ok(Cow::Borrowed(_))));

assert_eq!(
escape::unescape("&lt;&amp;test&apos;\r&quot;\r\n&gt;\r\n\r"),
Ok("<&test'\n\"\n>\n\n".into())
);
assert_eq!(escape::unescape("&#x30;\r\r\n"), Ok("0\n\n".into()));
assert_eq!(escape::unescape("\r&#48;\n\r\r"), Ok("\n0\n\n\n".into()));
assert_eq!(
escape::unescape("\r\n&foo;\n"),
Err(EscapeError::UnrecognizedEntity(3..6, "foo".into()))
);
}

/// XML allows any number of leading zeroes. That is not explicitly mentioned
/// in the specification, but enforced by the conformance test suite
/// (https://www.w3.org/XML/Test/)
Expand Down