-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Render complex text, variant forms, emoji, etc. #1
base: astral-cjk
Are you sure you want to change the base?
Conversation
To see the demonstration screenshotted above:
|
646d279
to
2c4bdb9
Compare
I had noticed some segmenting issues in Thai as a result of 304acf5, which led me on a quest to develop a better test suite for multilingual character segmentation. The big thing with Indic scripts is conjunct consonants (i.e. ligatures) and combining vowel characters. Each script does this a little differently, but there are a few commonalities here and there:
Here are a few test cases that I think are good indicators for poorly aligned combining characters and ligatures. Especially effective when 1
2
3
4
5
6
|
A live demo is available in osm-americana/openstreetmap-americana#1149. To make sure these changes are heading in the right direction, I solicited some feedback from native language speakers on the OSM Asia Telegram chat and OSM India Telegram chat. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
The Bangladesh, India, Nepal and Thailand subforums might also be good places to ask. |
This comment was marked as resolved.
This comment was marked as resolved.
Who needs mapbox-gl-rtl-text when you've got maplibre-gl-all-the-text? 😉 |
I put in a workaround for now, because reimplementing bidirectional text support would be a large enough task for its own proof of concept. The workaround is to replace the zero-width joiner with an arbitrarily chosen strip marker and restore it after bidi processing. I don’t think this workaround will interfere with the mapbox-gl-rtl-text plugin’s Arabic shaping. OSM only has four |
91e523c
to
be2d95e
Compare
The real-world convention appears to be that the top bar should be broken up by syllable when introducing letter spacing. However, line-placed labels have gaps even without any letter spacing. #2 demonstrates a partial solution by shifting the baseline. Using the same information, we might even be able to artificially extend the top bar to either side of the glyph to close the gap even when the text is offset. However, the tradeoff is a very noticeable shift between hanging and alphabetic text, which may be undesirable. The CSS specification describes a few example scenarios in which we would need to special-case the text segmentation differently for the purposes of line breaking, letter spacing, and text rendering. In some cases, we may even need to break apart and rearrange grapheme clusters to avoid a choppy appearance. I’m unsure whether this should be a high priority; it seems like native browser text layout doesn’t necessarily behave correctly either: w3c/iip#87. |
40ee424
to
abf39ed
Compare
Segment strings by grapheme cluster instead of by character when shaping and rendering text. Store glyphs, glyph requests, and glyph positions by grapheme cluster instead of by codepoint string. Added a simple polyfill for older versions of Firefox.
Increased the buffer around locally rendered glyphs.
Removed hard-coded fudge factors based on the baseline in Arial Unicode MS.
Thanks, that’s good to know about. Unfortunately, unicode-segmenter’s unpacked size is larger than mapbox-gl-rtl-text, which GL JS fetches from a CDN lazily due to its size, so I’m not sure the maintainers would be open to including it as a dependency for everyone. In maplibre#4541 (comment), they were open to requiring newer browsers going forward. Specifically, I introduced For compatibility’s sake, I added a much simpler polyfill that just splits the string by words based on |
Makes sense. Word breaks were anyway not supported by that polyfill( any attempt at that would have increased the size of that polyfill further ). And thanks for pushing this through. |
This error seems to occur whenever the new line breaking code encounters Zawgyi-encoded text. For example, Pakistan is tagged The broader issue is that I’ve tweaked the segmentation code to join grapheme clusters on virama characters, whereas the line breaking code’s word segmenter sometimes wants to break up the ligature. It only happens to be more common in Zawgyi-encoded text. This confuses a bit of code that tries to determine the grapheme’s advance based on the section’s formatting options. I’ve tweaked it to gracefully fall back to the last known section. This means a grapheme after a word wrap might be formatted according to what precedes it. As with Arabic text, this PR essentially removes the ability to style one part of a grapheme cluster differently than another. |
Is @bdon's Burmese Encoding QA tool is a good place to find examples of Zawgyi text on OSM. |
That’s a wonderful tool! I was mistakenly assuming that the Zawgyi-my ICU transform would stabilize if fed Unicode-encoded text and jumping to the conclusion that it was misencoded. Anyhow, |
Zawgyi and Unicode use the same codepoints in conflicting ways. The converter will happily let you convert any text Zawgyi to Unicode as many times as you like, even if it's already Unicode. Automatically detecting the encoding of Burmese text is not a trivial task.
|
If a grapheme cluster begins with a combining diacritical mark or ends with an invisible stacker, combine it with an adjacent grapheme cluster to avoid drawing diacritics over dotted circles or placeholder diacritics where adjacent characters should be ligated instead.
Added a script that fetches the latest Unicode character database’s property file for Indic syllable categories and generates a function for combining graphemes based on it.
Replace zero-width joiners with temporary strip markers to prevent ICU from stripping them.
Preemptively swap combining marks with the characters they modify to visual order, so that the RTL plugin will swap them back to logical order.
abdb9c4
to
979a434
Compare
Replaced custom word break heuristics when determining line breaks with a word segmenter. Added a simple polyfill for older versions of Firefox.
Fixed an issue where vertical CJK text was shifted upwards by one em.
Iterate over graphemes instead of words, looking for word boundaries to use as line breaking opportunities. This eliminates the possibility of word-wrapping in the middle of a grapheme cluster, which is valid in some writing systems such as Thai, but mitigates the risk of an invalid section index in Burmese, because the word segmenter considers some modifiers to be “words”.
979a434
to
b3b7359
Compare
Upon closer inspection, the errors were actually caused by a mismatch between the word segmenter, the built-in grapheme cluster segmenter, and the modified grapheme cluster segmenter as to |
This branch adds experimental support for rendering text in Indic and other complex scripts, variant character forms, and combining diacritics. As a side effect, some emoji sequences now appear as single glyphs, though only as silhouettes.
Text segmentation
Currently,
text-field
strings are segmented by UTF-16 code units. maplibre#4550 refactors various text processing classes to segment by full UTF-16 characters instead, expanding text rendering support to the rest of the Unicode character repertoire. However, in many common situations, a single Unicode character in isolation cannot adequately represent a grapheme cluster that the user perceives as a single glyph. This branch segments strings by grapheme cluster for rendering purposes. Additionally, it refactors the glyph atlas to index glyph data by grapheme cluster strings, whereas currently it is indexed by codepoints. (Actually, the codepoints are converted to numbers and then stringified for maximum inefficiency, apparently for consistency with the glyph PBF format.)Segmenting strings by grapheme cluster requires the
Intl.Segmenter
API, which is very new. Firefox was the last major holdout, adding support for this API only a few months ago. For older browsers, it will be necessary to fall back to segmenting by Unicode character. Grapheme cluster segmentation is not a panacea: maplibre/maplibre-native#778 (reply in thread) discusses some limitations around cursive scripts. Unless a workaround can be found, mapbox-gl-rtl-text plugin will probably continue to be required for Arabic typesetting.The segmenter understands emoji sequences, including sequences that include zero-width joiners. However, it is only possible to render the emoji’s silhouette for the time being, because the glyph atlas only stores the alpha channel of the glyph image. Storing each of the color channels would enable the shader to draw color emoji, as demonstrated in 1ec5/tiny-sdf#1, but it would need to be limited to detected emoji sequences to avoid largely frivolous overhead.
The custom word breaking heuristics for determining line breaking opportunities have been replaced by a word segmenter. This introduces word wrapping for the first time to writing systems such as Thai and Khmer that don’t put spaces between words. It also keeps Hanzi/hanja/kanji compound words together based on the browser’s built-in compound dictionary. This obviates the server-side workaround that Mapbox introduced in mapbox/mapbox-gl-js#8255 (before the fork). If a tileset has inserted zero-width spaces between compounds as hints, as the Mapbox Streets source does, the word segmenter will continue to honor those hints as a matter of course, but this functionality now comes “for free” without any developer intervention.
Local text rendering
As TinySDF is my hammer, everything looks like a nail. This branch completely disables the glyph PBF pipeline for remotely rendered glyphs in favor of rendering every grapheme cluster locally through TinySDF, making maplibre#4564 unnecessary. The changes obviate much of the original reason that Mapbox created a Fontstack API and defined the glyph PBF format.
The expanded use of TinySDF creates a need for more granular control over font selection beyond the single font specifier for local “ideographic” text. The developer can already set the font specifier to the name of a Web font defined in the surrounding webpage’s stylesheet. However, this option is currently global; the font choice should come instead from
text-font
, which is no longer used for remote glyph rendering. An event handler will need to be added to callMap.prototype.redraw
once any Web fonts are done loading.Vertical alignment has often been cited as a downside of TinySDF, but it’s actually the glyph PBFs that are to blame. Glyph PBFs don’t encode enough glyph metrics to reliably align glyphs to a common baseline, so there were hard-coded fudge factors in a few different places in the codebase to vertically shift locally rendered glyphs to match remotely rendered glyphs. These fudge factors assume the metrics of Arial Unicode MS, which is the default
text-font
fallback but is also an outlier for its line height, even among pan-Unicode fonts. With the removal of the glyph PBF mechanism, it becomes possible to remove these fudge factors.In theory, it should be possible to use TinySDF only for grapheme clusters that can’t be represented by the glyph PBF format. However, that would yield extremely inconsistent visual results, because most scripts that contain nontrivial grapheme clusters also have plenty of unclustered graphemes sprinkled throughout in ordinary text. If backwards compatibility with older styles is a concern, I believe it would be better to render
glyphsgrapheme clusters locally in general and at most render Latin, Cyrillic, and Greek from glyph PBFs as an exception.The glyph PBF mechanism has been primarily of use to Western languages. Most published fontstacks consist of one or two specially chosen Western fonts combined with a crude, pan-Unicode font as an afterthought to serve as a fallback for the rest of the world’s languages. While the glyph PBF format has the advantage of not requiring embedding and redistribution rights from the font designer, the most popular fonts for non-Western languages are generally open-source fonts that can be served up as Web fonts and rendered through TinySDF without any legal obstacles.
Prior art
maplibre#2458 similarly relies on TinySDF for all text. However, it requires the tileset or GeoJSON data to include manually placed control characters to mark grapheme cluster boundaries. I think any required server-side hinting would significantly limit the deployment of complex text rendering to end users, as Mapbox discovered early on: mapbox/DEPRECATED-mapbox-gl#4 (comment). Even though
Intl.Segmenter
still falls short of the gold standard in Harfbuzz/Raqm/FriBiDi, it comes with such low overhead that GL JS might as well take advantage of it rather than allow Indic text to continue to get mangled.Odds and ends
This branch also includes some miscellaneous fixes for things I spotted along the way. Some types were misspelled. Some unit tests relied on outdated fixtures that cast incompatible data to the expected data type; TypeScript only started flagging it once I modified the types just a little more.
These changes would fix maplibre#50 and maplibre#2384. I’m posting this draft in my own fork for now while I consider how to stage these changes in more manageable chunks and discuss the backwards compatibility issues with the MapLibre maintainers.
Additionally, the following proofs of concept are based on this one: