-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine the conversion of UTF-8 to bytes into a single base function #22703
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Prior to this commit these were illegal. This causes embed.fnc to generate macro 'Perl_foo' #defined to be macro 'foo'. This could be used to easily convert existing macros into having long names should some become a name space pollution problem. This also documents in embed.fnc, under the 'm' flag discussion, how to use this instead of placing things in mathoms.c, or creating stub functions.
Commit e4d3d0c removed all the calls to this function.
Outdent and reflow some comments and code in preparation for them to be moved out of the loop
This is for clarity. All this very-unlikely-to-be-used code was in the middle of what is really going on, creating a distraction.
The previous version did not make sure that it wasn't reading beyond the end of the buffer in all cases, and the first pass through the input string already ruled out it having most problems. Thus we don't need the full generality here of the macro UTF8_IS_DOWNGRADEABLE_START; and this simplifies things
These were misleading. On ASCII platforms, many calls to this function won't use the per-word algorithm. That's only done for long-enough strings.
The new name, s0, is used in more other places for this meaning, and is more descriptive.
This is an internal function, designed to be an extension of utf8_to_bytes(), with a slightly different API. This function just adds it and calls it from just utf8_to_bytes. Future commits will extend this API.
This adds a third return possibility to this new function. Its sole caller still treats it as a boolean for now.
This variable should not be being changed by the function
The argument is currently unused. The macro is a public facing API that calls this function with the correct argument
This makes the next commit smaller
This causes this function to be able to both overwrite the input, and to instead create new memory. It changes bytes_from_utf8() to use this new capability instead of being a near duplication of the core code of this function. Prior to this commit, bytes_from_utf8() just allocated memory the size of the original string, and started copying into it. When it came to a sequence that wasn't convertible, it stopped, and freed up the copy. The new behavior has it checking first before the malloc that the string is convertible. That has the advantage that there is no malloc without being sure it will be useful; but the disadvantage that there is an extra pass through the input string, but that pass is per-word. The next commit will introduce another advantage.
Prior to this commit, the size malloced was just the same as the length of the input string, which is a worst case scenario. This commit changes so the new pass through the input (introduced in the previous commit) also calculates the needed length. The additional cost of doing this is minimal. It has advantages on a very long string with lots of sequences that are convertible.
This is a non-destructive conversion of the input into native bytes, and with any new memory required set for destruction via SAVEFREEPV. This allows the caller to not have to be concerned at all if memory was created or not. A new macro is created that calls this internal function with the correct parameter to force this behavior.
khwilliamson
force-pushed
the
utf8_to_bytes_opts
branch
from
October 26, 2024 14:23
b99cd7d
to
ee373c8
Compare
This pull request is going forward instead via #22638 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There are two functions that do this currently,
utf8_to_bytes
andbytes_from_utf8
This combines the guts of their implementation.It also creates 3 new public macros to call the base function with a somewhat different API that is more convenient to use, and can avoid mallocs when not needed
The first three commits in this series are needed as a base for the rest, but are already in individual pull requests.