Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves fallback code in Utf8String::extractWords() #8298

Merged

Conversation

Sesquipedalian
Copy link
Member

@Sesquipedalian Sesquipedalian commented Jul 25, 2024

Deals with this:

/*
* This is a sad, weak substitute for the IntlBreakIterator.
* It works well enough for European languages, but it fails badly
* for many Asian languages. To improve it will require adding more
* data to our Unicode data files and then writing code to implement
* the Unicode word break algorithm.
* See https://www.unicode.org/reports/tr29/#Word_Boundaries
*/

Specifically, this PR implements the default Unicode word break algorithm as fallback code for when the IntlBreakIterator class is unavailable. As noted in the Unicode documentation for this algorithm:

It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default.

In other words, it is still best to use IntlBreakIterator if it is available, because it can adapt to the specific rules of a language, but the default is still a decent option for most languages most of the time.

@Sesquipedalian Sesquipedalian added the Localization Language & internationalization label Jul 25, 2024
@Sesquipedalian Sesquipedalian merged commit caf9be4 into SimpleMachines:release-3.0 Jul 25, 2024
6 checks passed
@Sesquipedalian Sesquipedalian deleted the word_break branch July 25, 2024 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Localization Language & internationalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant