You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The TypeScript reference implementations of various string expression operators operate on UTF-16 code units instead of full Unicode characters. This is most apparent when a string contains a character outside the Basic Multilingual Plane (BMP): that character is represented by a surrogate pair of two UTF-16 code units, but the expression operators split the surrogate pair in half.
Examples
This example contains two symbol layers that display the length and name, respectively, of a point feature in a GeoJSON source. The latter symbol layer is filtered to only features that contain “市镇” (meaning “town”) at the zero-based index of 2.
Two labels should appear, reading “4” and “丐𦨭市镇”. However, only one label appears, reading “5”.
This is because JavaScript stores 𦨭 (U+26A2D) as two UTF-16 code units: D85A DE2D.
For a more complex example, OpenStreetMap Americana labels place names in both the user’s preferred language and the prevailing local language. To avoid clutter, it applies some heuristics to deduplicate matching names between the two languages (ignoring diacritics when comparing against English). This requires a find-and-replace operation, but since maplibre/maplibre-gl-js#2064 and maplibre/maplibre-gl-js#2059 were both declined, the application includes some complex subexpressions that depend on length. This works fine now but will begin to return bizarre results once maplibre/maplibre-gl-js#4550 lands, even among strings that don’t contain any surrogate pairs.
Until now, the impact would’ve been minimal, because GL JS has avoided rendering any codepoint beyond U+FFFF that would require surrogate pairs. The filters and properties would’ve evaluated incorrectly for any affected feature, but that would’ve been less noticeable than the abridged label for the same feature. However, maplibre/maplibre-gl-js#4550 would implement support for rendering non-BMP characters, making this issue more noticeable.
Platform information
This issue reproduces in GL JS v4.5.2 but has probably been present ever since expressions were first implemented. The native implementation is even less intuitive, counting individual bytes: maplibre/maplibre-native#2730.
Diagnosis
The implementations of these expression operators use traditional String methods that operate on UTF-16 code units. As in maplibre/maplibre-gl-js#4550, they need to be replaced with a string iterator.
The TypeScript reference implementations of various string expression operators operate on UTF-16 code units instead of full Unicode characters. This is most apparent when a string contains a character outside the Basic Multilingual Plane (BMP): that character is represented by a surrogate pair of two UTF-16 code units, but the expression operators split the surrogate pair in half.
Examples
This example contains two symbol layers that display the length and name, respectively, of a point feature in a GeoJSON source. The latter symbol layer is filtered to only features that contain “市镇” (meaning “town”) at the zero-based index of 2.
Two labels should appear, reading “4” and “丐𦨭市镇”. However, only one label appears, reading “5”.
This is because JavaScript stores 𦨭 (U+26A2D) as two UTF-16 code units:
D85A DE2D
.For a more complex example, OpenStreetMap Americana labels place names in both the user’s preferred language and the prevailing local language. To avoid clutter, it applies some heuristics to deduplicate matching names between the two languages (ignoring diacritics when comparing against English). This requires a find-and-replace operation, but since maplibre/maplibre-gl-js#2064 and maplibre/maplibre-gl-js#2059 were both declined, the application includes some complex subexpressions that depend on
length
. This works fine now but will begin to return bizarre results once maplibre/maplibre-gl-js#4550 lands, even among strings that don’t contain any surrogate pairs.则拉市镇 (Chợ Lách)
则拉市镇 (则hợ Lách
Impact
Until now, the impact would’ve been minimal, because GL JS has avoided rendering any codepoint beyond U+FFFF that would require surrogate pairs. The filters and properties would’ve evaluated incorrectly for any affected feature, but that would’ve been less noticeable than the abridged label for the same feature. However, maplibre/maplibre-gl-js#4550 would implement support for rendering non-BMP characters, making this issue more noticeable.
Platform information
This issue reproduces in GL JS v4.5.2 but has probably been present ever since expressions were first implemented. The native implementation is even less intuitive, counting individual bytes: maplibre/maplibre-native#2730.
Diagnosis
The implementations of these expression operators use traditional
String
methods that operate on UTF-16 code units. As in maplibre/maplibre-gl-js#4550, they need to be replaced with a string iterator.maplibre-style-spec/src/expression/definitions/length.ts
Lines 35 to 36 in 76aabe5
maplibre-style-spec/src/expression/definitions/slice.ts
Line 65 in 76aabe5
maplibre-style-spec/src/expression/definitions/index_of.ts
Line 69 in 76aabe5
The text was updated successfully, but these errors were encountered: