Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String expressions operate on UTF-16 code units instead of characters #778

Closed
1ec5 opened this issue Aug 16, 2024 · 1 comment · Fixed by #779
Closed

String expressions operate on UTF-16 code units instead of characters #778

1ec5 opened this issue Aug 16, 2024 · 1 comment · Fixed by #779
Labels
bug Something isn't working PR is more than welcome

Comments

@1ec5
Copy link
Contributor

1ec5 commented Aug 16, 2024

The TypeScript reference implementations of various string expression operators operate on UTF-16 code units instead of full Unicode characters. This is most apparent when a string contains a character outside the Basic Multilingual Plane (BMP): that character is represented by a surrogate pair of two UTF-16 code units, but the expression operators split the surrogate pair in half.

Examples

This example contains two symbol layers that display the length and name, respectively, of a point feature in a GeoJSON source. The latter symbol layer is filtered to only features that contain “市镇” (meaning “town”) at the zero-based index of 2.

Two labels should appear, reading “4” and “丐𦨭市镇”. However, only one label appears, reading “5”.

The number 5 all alone.

This is because JavaScript stores 𦨭 (U+26A2D) as two UTF-16 code units: D85A DE2D.

For a more complex example, OpenStreetMap Americana labels place names in both the user’s preferred language and the prevailing local language. To avoid clutter, it applies some heuristics to deduplicate matching names between the two languages (ignoring diacritics when comparing against English). This requires a find-and-replace operation, but since maplibre/maplibre-gl-js#2064 and maplibre/maplibre-gl-js#2059 were both declined, the application includes some complex subexpressions that depend on length. This works fine now but will begin to return bizarre results once maplibre/maplibre-gl-js#4550 lands, even among strings that don’t contain any surrogate pairs.

Before After
则拉市镇 (Chợ Lách)
则拉市镇 (Chợ Lách)
则拉市镇 (则hợ Lách
则拉市镇 (则hợ Lách

Impact

Until now, the impact would’ve been minimal, because GL JS has avoided rendering any codepoint beyond U+FFFF that would require surrogate pairs. The filters and properties would’ve evaluated incorrectly for any affected feature, but that would’ve been less noticeable than the abridged label for the same feature. However, maplibre/maplibre-gl-js#4550 would implement support for rendering non-BMP characters, making this issue more noticeable.

Platform information

This issue reproduces in GL JS v4.5.2 but has probably been present ever since expressions were first implemented. The native implementation is even less intuitive, counting individual bytes: maplibre/maplibre-native#2730.

Diagnosis

The implementations of these expression operators use traditional String methods that operate on UTF-16 code units. As in maplibre/maplibre-gl-js#4550, they need to be replaced with a string iterator.

if (typeof input === 'string') {
return input.length;
return input.slice(beginIndex, endIndex);
return haystack.indexOf(needle, fromIndex);

@HarelM
Copy link
Collaborator

HarelM commented Aug 17, 2024

Feel free to submit a PR to fix this, I'm pretty sure you have the right knowledge, and I'm pretty sure I don't... :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PR is more than welcome
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants