Render complex text, variant forms, emoji, etc. #1

1ec5 · 2024-08-19T07:03:41Z

This branch adds experimental support for rendering text in Indic and other complex scripts, variant character forms, and combining diacritics. As a side effect, some emoji sequences now appear as single glyphs, though only as silhouettes.

Text segmentation

Currently, text-field strings are segmented by UTF-16 code units. maplibre#4550 refactors various text processing classes to segment by full UTF-16 characters instead, expanding text rendering support to the rest of the Unicode character repertoire. However, in many common situations, a single Unicode character in isolation cannot adequately represent a grapheme cluster that the user perceives as a single glyph. This branch segments strings by grapheme cluster for rendering purposes. Additionally, it refactors the glyph atlas to index glyph data by grapheme cluster strings, whereas currently it is indexed by codepoints. (Actually, the codepoints are converted to numbers and then stringified for maximum inefficiency, apparently for consistency with the glyph PBF format.)

Segmenting strings by grapheme cluster requires the Intl.Segmenter API, which is very new. Firefox was the last major holdout, adding support for this API only a few months ago. For older browsers, it will be necessary to fall back to segmenting by Unicode character. Grapheme cluster segmentation is not a panacea: maplibre/maplibre-native#778 (reply in thread) discusses some limitations around cursive scripts. Unless a workaround can be found, mapbox-gl-rtl-text plugin will probably continue to be required for Arabic typesetting.

The segmenter understands emoji sequences, including sequences that include zero-width joiners. However, it is only possible to render the emoji’s silhouette for the time being, because the glyph atlas only stores the alpha channel of the glyph image. Storing each of the color channels would enable the shader to draw color emoji, as demonstrated in 1ec5/tiny-sdf#1, but it would need to be limited to detected emoji sequences to avoid largely frivolous overhead.

The custom word breaking heuristics for determining line breaking opportunities have been replaced by a word segmenter. This introduces word wrapping for the first time to writing systems such as Thai and Khmer that don’t put spaces between words. It also keeps Hanzi/hanja/kanji compound words together based on the browser’s built-in compound dictionary. This obviates the server-side workaround that Mapbox introduced in mapbox/mapbox-gl-js#8255 (before the fork). If a tileset has inserted zero-width spaces between compounds as hints, as the Mapbox Streets source does, the word segmenter will continue to honor those hints as a matter of course, but this functionality now comes “for free” without any developer intervention.

Local text rendering

As TinySDF is my hammer, everything looks like a nail. This branch completely disables the glyph PBF pipeline for remotely rendered glyphs in favor of rendering every grapheme cluster locally through TinySDF, making maplibre#4564 unnecessary. The changes obviate much of the original reason that Mapbox created a Fontstack API and defined the glyph PBF format.

The expanded use of TinySDF creates a need for more granular control over font selection beyond the single font specifier for local “ideographic” text. The developer can already set the font specifier to the name of a Web font defined in the surrounding webpage’s stylesheet. However, this option is currently global; the font choice should come instead from text-font, which is no longer used for remote glyph rendering. An event handler will need to be added to call Map.prototype.redraw once any Web fonts are done loading.

Vertical alignment has often been cited as a downside of TinySDF, but it’s actually the glyph PBFs that are to blame. Glyph PBFs don’t encode enough glyph metrics to reliably align glyphs to a common baseline, so there were hard-coded fudge factors in a few different places in the codebase to vertically shift locally rendered glyphs to match remotely rendered glyphs. These fudge factors assume the metrics of Arial Unicode MS, which is the default text-font fallback but is also an outlier for its line height, even among pan-Unicode fonts. With the removal of the glyph PBF mechanism, it becomes possible to remove these fudge factors.

In theory, it should be possible to use TinySDF only for grapheme clusters that can’t be represented by the glyph PBF format. However, that would yield extremely inconsistent visual results, because most scripts that contain nontrivial grapheme clusters also have plenty of unclustered graphemes sprinkled throughout in ordinary text. If backwards compatibility with older styles is a concern, I believe it would be better to render ~~glyphs~~ grapheme clusters locally in general and at most render Latin, Cyrillic, and Greek from glyph PBFs as an exception.

The glyph PBF mechanism has been primarily of use to Western languages. Most published fontstacks consist of one or two specially chosen Western fonts combined with a crude, pan-Unicode font as an afterthought to serve as a fallback for the rest of the world’s languages. While the glyph PBF format has the advantage of not requiring embedding and redistribution rights from the font designer, the most popular fonts for non-Western languages are generally open-source fonts that can be served up as Web fonts and rendered through TinySDF without any legal obstacles.

Prior art

maplibre#2458 similarly relies on TinySDF for all text. However, it requires the tileset or GeoJSON data to include manually placed control characters to mark grapheme cluster boundaries. I think any required server-side hinting would significantly limit the deployment of complex text rendering to end users, as Mapbox discovered early on: mapbox/DEPRECATED-mapbox-gl#4 (comment). Even though Intl.Segmenter still falls short of the gold standard in Harfbuzz/Raqm/FriBiDi, it comes with such low overhead that GL JS might as well take advantage of it rather than allow Indic text to continue to get mangled.

Odds and ends

This branch also includes some miscellaneous fixes for things I spotted along the way. Some types were misspelled. Some unit tests relied on outdated fixtures that cast incompatible data to the expected data type; TypeScript only started flagging it once I modified the types just a little more.

These changes would fix maplibre#50 and maplibre#2384. I’m posting this draft in my own fork for now while I consider how to stage these changes in more manageable chunks and discuss the backwards compatibility issues with the MapLibre maintainers.

Additionally, the following proofs of concept are based on this one:

1ec5 · 2024-08-19T07:15:01Z

To see the demonstration screenshotted above:

Run npm run build-dist.
Drop the contents of this gist into a file named index.html in the dist/ folder.
Run npm start.
Open http://0.0.0.0:9966/dist/index.html#3.03/16.25/17.32

claysmalley · 2024-08-21T02:08:21Z

I had noticed some segmenting issues in Thai as a result of 304acf5, which led me on a quest to develop a better test suite for multilingual character segmentation.

The big thing with Indic scripts is conjunct consonants (i.e. ligatures) and combining vowel characters. Each script does this a little differently, but there are a few commonalities here and there:

Scripts of the northern Indian subcontinent tend to form conjuncts by reducing the first consonant to a "half form" and blending it into the following consonant. There are several exceptions.
Scripts of the southern subcontinent and Southeast Asia tend to form conjuncts by stacking a second consonant below, if at all. There are also several exceptions.
Thai and Lao don't have conjunct consonants. Consonant clusters are implied by context; there is no "virama" diacritic in modern use.
In conjuncts, /r/ often has a dramatically different appearance from its isolated form.
One notable exception within the subcontinent is the /kṣ/ conjunct (as in Lakshmi), which is usually a special form that looks unlike its components. This is even the case in Tamil, which otherwise has very few conjuncts compared to its neighbors.
In Burmese and Khmer (and some other scripts), consonant stacking is obligatory—these scripts have no "virama" diacritic that a renderer can simply fall back to. The Unicode codepoints referred to as VIRAMA in these scripts are actually Invisible Stackers (58ecf5c).
Thai and Lao are encoded in visual order (i.e. typewriter style) instead of logical order. Like other Indic scripts, Thai and Lao have certain vowel marks that are written preceding consonants. However, these particular vowels are encoded as standalone characters that literally precede their consonant, rather than being combining characters that advance the position of their attached consonant.
Tibetan is also encoded in visual order. Conjuncts are stacked, but each consonant is encoded with a separate combining character for the stacked version, instead of reusing the same codepoints with an Invisible Stacker in between.

Here are a few test cases that I think are good indicators for poorly aligned combining characters and ligatures. Especially effective when text-letter-spacing > 0.

1

"name_en": "Bengaluru",
"name_hi": "बेंगलुरु",
"name_gu": "બેંગલુરુ",
"name_pa": "ਬੈਂਗਲੁਰੂ",
"name_bn": "বেঙ্গালুরু",
"name_or": "ବେଙ୍ଗାଲୁରୁ",
"name_te": "బెంగళూరు",
"name_kn": "ಬೆಂಗಳೂರು",
"name_ml": "ബെംഗളൂരു",
"name_ta": "பெங்களூரு",
"name_si": "බැංගලෝර්",

2

"name_en": "Lakshmeshwara",
"name_hi": "लक्ष्मेश्वर",
"name_gu": "લક્ષ્મેશ્વર",
"name_pa": "ਲਕ੍ਸ਼੍ਮੇਸ਼੍ਵਰਾ",
"name_bn": "লক্ষ্মীশ্বর",
"name_or": "ଲକ୍ଷମେଶ୍ୱର",
"name_te": "లక్ష్మేశ్వర",
"name_kn": "ಲಕ್ಷ್ಮೇಶ್ವರ",
"name_ml": "ലക്ഷ്മേശ്വര",
"name_ta": "லக்ஷ்மேஸ்வரா",
"name_si": "ලක්ෂ්මේෂ්වර",

3

"name_en": "Mandalay",
"name_my": "မန္တလေးမြို့",

4

"name_en": "Mekong River",
"name_my": "မဲခေါင်မြစ်",
"name_th": "แม่น้ำโขง",
"name_lo": "ແມ່ນ້ຳຂອງ",
"name_km": "ទន្លេមេគង្គ",

5

"name_en": "Samdrup Jongkhar District",
"name_dz": "བསམ་གྲུབ་ལྗོངས་མཁར་རྫོང་ཁག་",

6

"name_en": "Blue Heron Nest Park",
"name_hur": "sməqʷəʔelə həw̓aləm̓ew̓txʷ",

1ec5 · 2024-08-21T18:09:31Z

A live demo is available in osm-americana/openstreetmap-americana#1149. To make sure these changes are heading in the right direction, I solicited some feedback from native language speakers on the OSM Asia Telegram chat and OSM India Telegram chat.

claysmalley · 2024-08-21T20:38:50Z

I solicited some feedback from native language speakers on the OSM Asia Telegram chat and OSM India Telegram chat.

The Bangladesh, India, Nepal and Thailand subforums might also be good places to ask.

claysmalley · 2024-08-21T23:12:05Z

I’m of half a mind to replace mapbox-gl-rtl-text’s processBidirectionalText with something homegrown

Who needs mapbox-gl-rtl-text when you've got maplibre-gl-all-the-text? 😉

1ec5 · 2024-08-22T02:25:24Z

I put in a workaround for now, because reimplementing bidirectional text support would be a large enough task for its own proof of concept. The workaround is to replace the zero-width joiner with an arbitrarily chosen strip marker and restore it after bidi processing. I don’t think this workaround will interfere with the mapbox-gl-rtl-text plugin’s Arabic shaping. OSM only has four name:ar=* tags that contain ZWJs, three of them seemingly by mistake and the fourth seemingly in Soranî, which isn’t supported by the plugin anyways.

claysmalley · 2024-08-22T22:03:45Z

(Edit: see following comment)

~~If there isn't a way to preserve the shirorekha across gaps between letters, then the text-letter-spacing property will make Bengali, Devanagari and Gurmukhi look clunky:~~

1ec5 · 2024-08-23T11:42:55Z

If there isn't a way to preserve the shirorekha across gaps between letters, then the text-letter-spacing property will make Bengali, Devanagari and Gurmukhi look clunky:

The real-world convention appears to be that the top bar should be broken up by syllable when introducing letter spacing. However, line-placed labels have gaps even without any letter spacing. #2 demonstrates a partial solution by shifting the baseline. Using the same information, we might even be able to artificially extend the top bar to either side of the glyph to close the gap even when the text is offset. However, the tradeoff is a very noticeable shift between hanging and alphabetic text, which may be undesirable.

The CSS specification describes a few example scenarios in which we would need to special-case the text segmentation differently for the purposes of line breaking, letter spacing, and text rendering. In some cases, we may even need to break apart and rearrange grapheme clusters to avoid a choppy appearance. I’m unsure whether this should be a high priority; it seems like native browser text layout doesn’t necessarily behave correctly either: w3c/iip#87.

Segment strings by grapheme cluster instead of by character when shaping and rendering text. Store glyphs, glyph requests, and glyph positions by grapheme cluster instead of by codepoint string. Added a simple polyfill for older versions of Firefox.

Increased the buffer around locally rendered glyphs.

Removed hard-coded fudge factors based on the baseline in Arial Unicode MS.

1ec5 · 2024-08-27T16:39:01Z

Thanks, that’s good to know about. Unfortunately, unicode-segmenter’s unpacked size is larger than mapbox-gl-rtl-text, which GL JS fetches from a CDN lazily due to its size, so I’m not sure the maintainers would be open to including it as a dependency for everyone. In maplibre#4541 (comment), they were open to requiring newer browsers going forward. Specifically, I introduced /…/u literals that work for 97.56% of browser users. (Unsupporting browsers would fail to load GL JS at all with a syntax error.) However, Intl.Segmenter is much newer: as written, this branch works in only 95.49% of browser users.

For compatibility’s sake, I added a much simpler polyfill that just splits the string by words based on RegExp word boundaries and by “grapheme clusters” based on Unicode character positions. Here’s what it’ll look like in a browser without support for Intl.Segmenter. Obviously it’s far from ideal, but hopefully it’s more usable than the status quo:

ramSeraph · 2024-08-27T17:02:08Z

Makes sense. Word breaks were anyway not supported by that polyfill( any attempt at that would have increased the size of that polyfill further ). And thanks for pushing this through.

claysmalley · 2024-08-27T17:06:53Z

When the language is set to Burmese, several tiles fail to load because something is undefined. I can't replicate this with any other language.

1ec5 · 2024-08-28T05:33:56Z

This error seems to occur whenever the new line breaking code encounters Zawgyi-encoded text. For example, Pakistan is tagged name:my=ပါကစ္စတန်, which contains န် (U+1014 U+103A). Transforming it to valid Unicode, “ပါကစ်စတနျ”, would resolve the issue, at least for that particular label.

The broader issue is that I’ve tweaked the segmentation code to join grapheme clusters on virama characters, whereas the line breaking code’s word segmenter sometimes wants to break up the ligature. It only happens to be more common in Zawgyi-encoded text. This confuses a bit of code that tries to determine the grapheme’s advance based on the section’s formatting options.

I’ve tweaked it to gracefully fall back to the last known section. This means a grapheme after a word wrap might be formatted according to what precedes it. As with Arabic text, this PR essentially removes the ability to style one part of a grapheme cluster differently than another.

claysmalley · 2024-08-28T15:56:02Z

Is name:my=ပါကစ္စတန် really Zawgyi-encoded? That seems like the correct Unicode representation of the name of Pakistan, at least according to Burmese Wikipedia.

@bdon's Burmese Encoding QA tool is a good place to find examples of Zawgyi text on OSM.

1ec5 · 2024-08-28T16:11:10Z

That’s a wonderful tool! I was mistakenly assuming that the Zawgyi-my ICU transform would stabilize if fed Unicode-encoded text and jumping to the conclusion that it was misencoded. Anyhow, Intl.Segmenter consistently interprets an invisible stacker (such as a virama) as both a word boundary and a grapheme cluster boundary, but this PR combines the adjacent grapheme clusters. It’s unclear to me if we should therefore avoid a line break, but at least န် is fixed. I’m still seeing some errors involving ရှ် in ဘင်္ဂလားဒေ့ရှ် that I’ll need to investigate further.

claysmalley · 2024-08-28T16:39:27Z

I was mistakenly assuming that the Zawgyi-my ICU transform would stabilize if fed Unicode-encoded text and jumping to the conclusion that it was misencoded.

Zawgyi and Unicode use the same codepoints in conflicting ways. The converter will happily let you convert any text Zawgyi to Unicode as many times as you like, even if it's already Unicode. Automatically detecting the encoding of Burmese text is not a trivial task.

~~ရှ် is likely Zawgyi. There are only a few consonants that the vowel killer mark can appear above, and that's not one of them.~~ I was wrong; this is Unicode. I guess the rules are relaxed for foreign loanwords.

If a grapheme cluster begins with a combining diacritical mark or ends with an invisible stacker, combine it with an adjacent grapheme cluster to avoid drawing diacritics over dotted circles or placeholder diacritics where adjacent characters should be ligated instead.

Added a script that fetches the latest Unicode character database’s property file for Indic syllable categories and generates a function for combining graphemes based on it.

Replace zero-width joiners with temporary strip markers to prevent ICU from stripping them.

Preemptively swap combining marks with the characters they modify to visual order, so that the RTL plugin will swap them back to logical order.

Replaced custom word break heuristics when determining line breaks with a word segmenter. Added a simple polyfill for older versions of Firefox.

Fixed an issue where vertical CJK text was shifted upwards by one em.

Iterate over graphemes instead of words, looking for word boundaries to use as line breaking opportunities. This eliminates the possibility of word-wrapping in the middle of a grapheme cluster, which is valid in some writing systems such as Thai, but mitigates the risk of an invalid section index in Burmese, because the word segmenter considers some modifiers to be “words”.

1ec5 · 2024-09-07T18:07:01Z

Upon closer inspection, the errors were actually caused by a mismatch between the word segmenter, the built-in grapheme cluster segmenter, and the modified grapheme cluster segmenter as to င်္ and း in the same word. Since the section indices are tightly coupled to grapheme clusters, I’ve rewritten the line breaking code to iterate over grapheme clusters, looking for word boundaries, instead of the other way around. In theory, this may eliminate some valid line breaking opportunities in Thai and Khmer that split grapheme clusters, but optimal line breaking isn’t as critical as avoiding exceptions in text rendering.

1ec5 self-assigned this Aug 19, 2024

1ec5 mentioned this pull request Aug 19, 2024

Unicode ligatures and combining characters not displaying properly osm-americana/openstreetmap-americana#827

Open

1ec5 force-pushed the complex-text-50 branch 3 times, most recently from 646d279 to 2c4bdb9 Compare August 20, 2024 01:18

This was referenced Aug 21, 2024

Render non-BMP CJKV characters locally maplibre/maplibre-gl-js#4550

Open

Preview of MapLibre text rendering overhaul osm-americana/openstreetmap-americana#1149

Draft

1ec5 force-pushed the complex-text-50 branch from 711c33e to d79ddf2 Compare August 21, 2024 17:22

1ec5 force-pushed the astral-cjk branch from 4a7b35a to 15627fa Compare August 21, 2024 17:40

1ec5 force-pushed the complex-text-50 branch from d79ddf2 to f87dc49 Compare August 21, 2024 17:46

1ec5 force-pushed the astral-cjk branch from 15627fa to acdb77d Compare August 21, 2024 17:47

1ec5 force-pushed the complex-text-50 branch from f87dc49 to 6b5c031 Compare August 21, 2024 17:48

This comment was marked as resolved.

Sign in to view

1ec5 mentioned this pull request Aug 22, 2024

Line height too tall? osm-americana/fontstack66#4

Open

1ec5 force-pushed the complex-text-50 branch 2 times, most recently from 91e523c to be2d95e Compare August 22, 2024 03:30

1ec5 mentioned this pull request Aug 23, 2024

Vary grapheme baselines by script #2

Draft

1ec5 force-pushed the astral-cjk branch 2 times, most recently from 40ee424 to abf39ed Compare August 23, 2024 12:21

1ec5 force-pushed the complex-text-50 branch from 1eae64b to 11fc5b5 Compare August 23, 2024 12:28

1ec5 added 7 commits August 27, 2024 09:31

Render all text locally

c5da043

Render italic, oblique fonts locally

b451fd6

Fixed clipping of wide glyphs

a779668

Increased the buffer around locally rendered glyphs.

Use local baseline metric

f178331

Removed hard-coded fudge factors based on the baseline in Arial Unicode MS.

Center grapheme cluster within bounding box

967ded7

Collapse control characters

e53e7a0

1ec5 force-pushed the complex-text-50 branch from 0a42e34 to 29de7cc Compare August 27, 2024 16:32

1ec5 force-pushed the complex-text-50 branch from ebff14d to abdb9c4 Compare August 28, 2024 16:33

1ec5 added 8 commits September 7, 2024 00:26

Generate Unicode character property data at build time

9600223

Added a script that fetches the latest Unicode character database’s property file for Indic syllable categories and generates a function for combining graphemes based on it.

Join grapheme clusters on zero-width joiners

23d7dc6

Fixed ligatures in Sinhala

79f9fe2

Replace zero-width joiners with temporary strip markers to prevent ICU from stripping them.

Streamlined RTL text detection

3cb401c

Streamlined property escape regular expressions

2590574

Fixed combining right-to-left characters

7c49df9

Preemptively swap combining marks with the characters they modify to visual order, so that the RTL plugin will swap them back to logical order.

Streamlined letter spacing check

e4177a3

1ec5 force-pushed the complex-text-50 branch from abdb9c4 to 979a434 Compare September 7, 2024 07:46

1ec5 added 3 commits September 7, 2024 10:57

Break lines based on word segmentation

872625f

Replaced custom word break heuristics when determining line breaks with a word segmenter. Added a simple polyfill for older versions of Firefox.

Fixed vertical text advance

d5597ac

Fixed an issue where vertical CJK text was shifted upwards by one em.

1ec5 force-pushed the complex-text-50 branch from 979a434 to b3b7359 Compare September 7, 2024 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Render complex text, variant forms, emoji, etc. #1

Render complex text, variant forms, emoji, etc. #1

1ec5 commented Aug 19, 2024 •

edited

Loading

1ec5 commented Aug 19, 2024

claysmalley commented Aug 21, 2024 •

edited

Loading

1ec5 commented Aug 21, 2024

This comment was marked as resolved.

This comment was marked as resolved.

claysmalley commented Aug 21, 2024

This comment was marked as resolved.

claysmalley commented Aug 21, 2024

1ec5 commented Aug 22, 2024 •

edited

Loading

claysmalley commented Aug 22, 2024 •

edited

Loading

1ec5 commented Aug 23, 2024

1ec5 commented Aug 27, 2024 •

edited

Loading

ramSeraph commented Aug 27, 2024

claysmalley commented Aug 27, 2024

1ec5 commented Aug 28, 2024 •

edited

Loading

claysmalley commented Aug 28, 2024

1ec5 commented Aug 28, 2024 •

edited

Loading

claysmalley commented Aug 28, 2024 •

edited

Loading

1ec5 commented Sep 7, 2024 •

edited

Loading

Render complex text, variant forms, emoji, etc. #1

Are you sure you want to change the base?

Render complex text, variant forms, emoji, etc. #1

Conversation

1ec5 commented Aug 19, 2024 • edited Loading

Text segmentation

Local text rendering

Prior art

Odds and ends

1ec5 commented Aug 19, 2024

claysmalley commented Aug 21, 2024 • edited Loading

1ec5 commented Aug 21, 2024

This comment was marked as resolved.

This comment was marked as resolved.

claysmalley commented Aug 21, 2024

This comment was marked as resolved.

claysmalley commented Aug 21, 2024

1ec5 commented Aug 22, 2024 • edited Loading

claysmalley commented Aug 22, 2024 • edited Loading

1ec5 commented Aug 23, 2024

1ec5 commented Aug 27, 2024 • edited Loading

ramSeraph commented Aug 27, 2024

claysmalley commented Aug 27, 2024

1ec5 commented Aug 28, 2024 • edited Loading

claysmalley commented Aug 28, 2024

1ec5 commented Aug 28, 2024 • edited Loading

claysmalley commented Aug 28, 2024 • edited Loading

1ec5 commented Sep 7, 2024 • edited Loading

1ec5 commented Aug 19, 2024 •

edited

Loading

claysmalley commented Aug 21, 2024 •

edited

Loading

1ec5 commented Aug 22, 2024 •

edited

Loading

claysmalley commented Aug 22, 2024 •

edited

Loading

1ec5 commented Aug 27, 2024 •

edited

Loading

1ec5 commented Aug 28, 2024 •

edited

Loading

1ec5 commented Aug 28, 2024 •

edited

Loading

claysmalley commented Aug 28, 2024 •

edited

Loading

1ec5 commented Sep 7, 2024 •

edited

Loading