-
-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Render non-BMP CJKV characters locally #4550
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4550 +/- ##
==========================================
+ Coverage 87.74% 88.00% +0.26%
==========================================
Files 247 247
Lines 33494 33600 +106
Branches 2414 2430 +16
==========================================
+ Hits 29390 29571 +181
+ Misses 3118 3026 -92
- Partials 986 1003 +17 ☔ View full report in Codecov by Sentry. |
@@ -21,16 +21,18 @@ | |||
"properties": { | |||
"name_en": "abcde", | |||
"name_ja": "あいうえお", | |||
"name_zh": "阿衣乌唉哦", | |||
"name_ko": "아이우" | |||
"name_zh": "阿衣乌唉哦𪚥", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only render test that cover these changes? Do we need other tests?
} | ||
|
||
prevChar = {value: char.value, premature: false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the permutate
flag here.
This seems like a change of logic from the original method, so I would consider adding some unit tests here to make sure we get the right results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prevChar
is analogous to the old prevCharCode
, but it parallels what the iterators return. This is using premature
to flag a not-yet-set state, similar to done
in the iterators’ return values. I agree that some more vertical layout tests are needed to exercise these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think unit tests can cover the logic in this method, and a render test or two can help here as well to keep parity with native and make sure this is working as expected.
You can download the html report from the CI and commit the image that you download from the CI so that the CI will pass, assuming the image is correct. |
test/integration/render/tests/text-local-ideographs/cjk/expected-mac.png
Outdated
Show resolved
Hide resolved
CC: @louwers there are expected changes in parity here maybe, just FYI. |
@@ -75,7 +75,7 @@ export type TextJustify = 'left' | 'center' | 'right'; | |||
const PUAbegin = 0xE000; | |||
const PUAend = 0xF8FF; | |||
|
|||
class SectionOptions { | |||
export class SectionOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to export those classes? Can we test the content of this file without exporting those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests need to refer to these classes. Would you suggest a helper function that encapsulates the creation of a TaggedString
, but only for testing purposes?
maplibre-gl-js/src/symbol/shaping.test.ts
Lines 17 to 19 in 249b360
const tagged = new TaggedString(); | |
tagged.text = ' \t \v '; | |
tagged.sections = [new SectionOptions()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as far as I understand (which is not a lot), these classes were used inside this file by other exported classes or methods.
If this is the case, I think it might be better to test the exported classes/methods in the unit test in a flow that reaches these non exported classes.
If this is not possible, I would then prefer to have these classes moved to their own file and have unit test to these new files separately. In general, I like it better when I have a single file for a single class, but that might just be me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we’ll have to split the classes into separate files. The other exported functions in this file do so much that it would be difficult to isolate the most important behavior that these tests are intended to exercise.
BTW, is this a breaking change in terms of behavior of the map/text/fonts? |
No |
'Supplementary Private Use Area-A': (char) => char >= 0xF0000 && char <= 0xFFFFF, | ||
'Supplementary Private Use Area-B': (char) => char >= 0x100000 && char <= 0x10FFFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wipfli, if I understand this article correctly, your solution for Devanagari text shaping uses Private Use Area codepoints but has to steer clear of image placeholders that are also assigned to this block. With the changes in this PR, you could probably use the much more spacious Supplementary Private Use Areas instead. Most CJKV fonts informally use these blocks for characters that haven’t been officially encoded yet, but I wouldn’t expect vector tiles or GeoJSON to take advantage of these characters due to interoperability issues. You’d still be constrained by SymbolBucket.MAX_GLYPHS
, but at least you could pick a different plane without worrying about overlapping with unrelated usage before you reach that limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, your pull request would be very useful for positioned glyph fonts!
808f90c
to
a440cb6
Compare
This comment was marked as resolved.
This comment was marked as resolved.
I've fixed it in the global branch, it's not that the size had tripled but only that you added 2k, which is fine, just increase the expected size so that the test will pass... |
Just to double-check, the MVT spec does support UTF-16? |
Depends what you mean by “support”, but what we’re doing here is compatible with the MVT specification. As of MVT v2.1, a feature property value is a Although this is an improvement, it isn’t quite perfect: some characters and emoji require variation selectors and other control characters, so that what the user perceives as a grapheme consists of several UTF-16 characters. This PR doesn’t address that issue, because GL JS still pervasively assumes that one codepoint equals one glyph. If someday we eliminate that assumption, we can write custom code or use a Unicode set regular expression to match those sequences. (Unicode sets are very new, introduced to Firefox about a year ago.) |
src/render/glyph_manager.ts
Outdated
entry.ranges[range] = true; | ||
return {stack, id, glyph: null}; | ||
} else { | ||
throw new Error('glyphs > 65535 not supported'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this error at @wipfli’s request in #4550 (comment), but unless we solve #4563, this will be a breaking change and a difficult one for developers to cope with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky situation indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4564 provides a fallback to local glyph rendering, so it won’t be a breaking change after all. The behavior is essentially what I floated in #4550 (comment), except that server-side glyphs take precedence over client-side glyphs for non-CJKV codepoints.
if (isChar['CJK Compatibility'](char)) return true; | ||
if (isChar['CJK Compatibility Forms'](char)) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still looks like we are repeating the process of uncommenting a line in unicodeBlockLookup
and then adding it as another "if" row here.
Is it possible to change this in one place instead of two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we should inline the lookup functions wherever we use them instead of keeping them in a separate lookup table? We might as well eliminate that file in favor of inline regular expressions like this:
if (/* CJK Compatibility Forms */ /[\uFE30-\uFE4F]/.test(String.fromCodePoint(char))) return true;
I think the separate lookup table was created originally out of concern that these character classes would be less readable or more error-prone. I don’t have a strong preference, but then again, I have a higher tolerance for regular expressions than most developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about regular expression (I don't like them much, they are not readable when things get complicated), but more about using the same hard coded string in two places.
I don't mind keeping the original lookup table, but then why do we pass the same hardcoded string here again? Wht not use something like Object.keys(unicodeBlockLookup)...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the functions that use unicodeBlockLookup
should ever iterate over all the keys. Each function only selectively accesses the lookup functions relevant to the task at hand. As the table in #4560 (comment) illustrates, it’s a big Venn diagram; every function cares about a slightly different set of Unicode blocks. I tried to unify the functions somewhat, but inherently we need to list specific blocks in each function in order to have the correct behavior.
These strings are the official Unicode block names, not something we invented here, and the ISO standard basically guarantees that they won’t change. If the problem is that these strings add to the bundle size, we could CamelCase them (as in maplibre-native) and use the property syntax so that maybe the property names will get minified at compile time:
if (isChar.CJKCompatibilityForms(char)) return true;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't care about bundle size in this respect.
What I would like to achieve is the need to change in multiple places.
i.e. add all the needed information to the lookup table so that uncommenting a line and setting the right properties would be the only change need to support another script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, the only reason we’d ever modify this file is that the Unicode standard itself has changed (to add another script). If this PR works as designed, then we’d have comprehensive support for the latest Unicode standard, and we shouldn’t have to manually add new scripts in general, though that is sort of what’s going on in charInSupportedScript
, which I haven’t touched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the above suggestion is better in terms that in encapsulates all the relevant properties to a single place and later on you won't need to worry about them, hopefully, or you could create another method to have a different "query" on this object.
You can also define the type of every element so that it adheres to the relevant model.
I find it more maintainable, but it might just be me, IDK...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven’t abandoned this PR. 🙂 In 1ec5#1, I’m prototyping, among other things, code generation for anything we’re doing based on the Unicode Character Database that isn’t directly supported by ECMAScript regular expressions. It has the potential to reduce maintenance overhead for script detection, though I suspect we’ll still need to roll a little bit of our own code for some remaining typographical functionality that isn’t expressed in the Unicode standard. Once the branch is in better shape (testing, performance), I’ll pick it apart and post formal PRs to this repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bespoke build script I wrote in 1ec5/maplibre-gl-js@be6d7fe is simple and effective for this purpose, but I wonder if you’d prefer to pull in @unicode/unicode-15.1.0
via regenerate as a devDependency instead. It’s a big package to pull in at install time, but unlike my script, it won’t require fetching anything from the network. Either way, it’ll have negligible impact on the build, just a big regular expression that we fortunately wouldn’t have to maintain by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dev dependency is better I think.
Make sure to add the generated file to git ignore and add a .g.ts file extension to the generated file to indicate that it's generated.
If this PR lands before maplibre/maplibre-style-spec#778 is fixed, there will be regressions in some styles even for some features that only involve BMP characters. |
Style spec has a version of its own, so fixing it there first will be the right approach, after that we can merge this PR only with the upgrade of style-spec to the required version. Bottom line, there's no problem syncing it properly, we just need to make sure it is solved and applied correctly. |
Generalized glyph management, shaping code to operate on UTF-16 characters instead of code units, no longer splitting surrogate pairs. Enabled local glyph rendering for additional ideographic code blocks in the Supplementary Ideographic Plane and Tertiary Ideographic Plane for improved Chinese, Japanese, Korean, and Vietnamese character coverage.
Enabled local glyph rendering in some additional CJKV code blocks, as well as in Tangut, which is thrash-prone like CJKV. Disabled letter spacing for more cursive scripts.
Removed the error that artificially prevented the glyph manager from requesting glyph PBFs for ranges beyond U+FFFF.
acdb77d
to
40ee424
Compare
@1ec5 version 20.3.1 of the style spec was merged to this repo. |
The risk I mentioned before was only if we landed this PR before maplibre/maplibre-style-spec#779, but the other way around should be safe. I don’t think anything will break if you release GL JS with just the upgrade to the style-spec dependency, so there’s no need to revert. This PR needs some additional work on render tests, plus the dependency change in #4550 (comment) that I’ll be staging shortly in a separate PR. |
Thanks for the info! I've published a new version - 4.7. |
Generalized glyph management and shaping code to operate on UTF-16 characters instead of code units, no longer splitting surrogate pairs. Enabled local glyph rendering for additional ideographic code blocks in the Supplementary Ideographic Plane (SIP) and Tertiary Ideographic Plane (TIP) for improved Chinese, Japanese, Korean, and Vietnamese (CJKV) character coverage.
Enabled local glyph rendering in some additional CJKV code blocks in the SIP and TIP, as well as in Tangut, which is thrash-prone like CJKV. Disabled letter spacing for more cursive scripts.
This is a step towards #2307, which tracks support for characters outside the Basic Multilingual Plane more generally, not just for CJKV. This PR depends on #4560.
Demonstration
那🌫️市镇
那𧯄市镇
丐🌫️市镇
丐𦨭市镇
拜🌫️社
拜𩡋社
Launch Checklist
CHANGELOG.md
under the## main
section.