Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add normalize_line_end for unescape and test #807
base: master
Are you sure you want to change the base?
Add normalize_line_end for unescape and test #807
Changes from 6 commits
4f5ce0d
45a63c5
1c2ddc5
139b2cb
6962ae3
11320ad
0bb282e
f970370
d11c756
d3ee1ad
cbf200f
e4f83a5
cf99a45
628dce7
44df9bf
4e1eefc
ecfeefc
d880edc
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Escaping may be on the hot path, so it is not fine that we can allocate new owned strings just to get a reference and put it to already allocated string. Would it possible instead use already managed
Cow
and pass it to thenormalize_line_end
? Then upgradingCow
from borrowed to owned could be performed either inunescape
, or innormalize_line_end
and that upgrade always will be performed only once.Also, do we really need two loops?
memchr
can search for three bytes at once and we search for three bytes:&
,;
and\r
. So it could be possible to perform unescaping and line end normalization in one loop.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to do it. What if there is nothing to unescape? You still need to call
normalize_line_end
for such input in the end, then what is the correct function parameter fornormalize_line_end
so that it can handle both situations?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, don't understand what the trouble?
unescape
will always be called when we request data via API methods. If cycling search of&
,;
, or\r
will not find anything, than it is really nothing to change in the data (I assume that cycles for unescaping and normalization merged in one).Even if did not merge loops,
normalize_line_end
can accept&mut Cow<str>
and change it instead of creating an ownCow
by taking reference to&str
and returning its ownCow<str>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about the not merge loops, as you point out that there are more line end situations to consider, it is beyond my knowledge to do SIMD stuff myself to merge the loops.
Current function signature is:
fn normalize_line_end(input: &str) -> Cow<str>
Change to what? How to call it?
Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, read my
&mut Cow<str>
as&mut Option<String>
. So the whole code will look like:Actually, SIMD and merging loops are two separate conception, it is not necessary to explicitly apply SIMD techniques if you not aware of them, it's totally fine. I only tried to be helpful if you aware of it.
The requirement in only is not to make things worse. First of all we should not make it impossible to correctly process input (that is why we should at least process
\r\u0085
because if we turn it into\n\u0085
, the user will see two newlines in data instead of one, because specification allows\n
,\r\u0085
and\u0085
be served as (one) newline characters). The second, we should to try that without much impact to performance, so if we can avoid unnecessary allocations we should do that.If you're having a hard time figuring out how to get there, that's fine, it could be so. Thanks for opening PR anyway. I will consider it later when I return to active phase on the project (currently working on other project) and see how to improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for showing the code.
Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?
Then you still have to call normlize_line_end to normalize the original input. How to you put original input raw str as parameter to your current unescape function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unescape function decides where it is need to unescape something or not. It itself called always (well, because user requests unescaped and normalized data by demand, it will be called by demand, but that "demand" means "unesсape [and normalize] if there is anything")
Cannot understand what confuses you here. Calls to
unescape
will not be changed and it is already called whenever necessary.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to escape means there is no
&
and;
in raw input. It will not go insidewhile
below. So normalize will not happen to input here.Will not go inside here either because nothing has been unescaped
Must run normalize here, so how to call your changed function
fn normalize_line_end(input: &mut Option<String>, raw_len: usize)
here with raw as parameter?