Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt GB18030-2022 #336

Merged
merged 3 commits into from
Oct 4, 2024
Merged

Adopt GB18030-2022 #336

merged 3 commits into from
Oct 4, 2024

Conversation

annevk
Copy link
Member

@annevk annevk commented Sep 18, 2024

This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

  1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
  2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
  3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. The aim is to complete that with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.

(See WHATWG Working Mode: Changes for more details.)


Preview | Diff

This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. The aim is to complete that with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.
annevk added a commit to annevk/WebKit that referenced this pull request Sep 18, 2024
https://bugs.webkit.org/show_bug.cgi?id=279903

Reviewed by NOBODY (OOPS!).

For GBK and gb18030 we have used the same backing table for quite a
while now. This backing table was updated to account for GB18030-2022
at some point and this impacted GBK as well.

However, the encoder side table was kept disabled for GBK, despite it
actually allowing GBK to be more compatible with its former self.

whatwg/encoding#336 now standardizes the
behavior that GBK and gb18030 are to remain aligned in these matters
and this change implements that.

The corresponding tests are from this PR:
web-platform-tests/wpt#48240

* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-decoder.any.js:
* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-encoder-expected.txt:
* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-encoder.html:
* Source/WebCore/PAL/pal/text/TextCodecCJK.cpp:
(PAL::gb18030AsymmetricEncode):
(PAL::gbEncodeShared):
encoding.bs Outdated Show resolved Hide resolved
Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically LGTM, but please 1) revise the informative description of index-gb18030-ranges.txt and 2) please re-run the visualization generator script.

Thanks.

encoding.bs Outdated Show resolved Hide resolved
@xfq xfq added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Sep 19, 2024
@annevk annevk added normative addition/proposal New features or enhancements labels Sep 19, 2024
@xfq xfq added the i18n-clreq Notifies Chinese script experts of relevant issues label Sep 20, 2024
webkit-commit-queue pushed a commit to annevk/WebKit that referenced this pull request Sep 20, 2024
https://bugs.webkit.org/show_bug.cgi?id=279903

Reviewed by Alex Christensen.

For GBK and gb18030 we have used the same backing table for quite a
while now. This backing table was updated to account for GB18030-2022
at some point and this impacted GBK as well.

However, the encoder side table was kept disabled for GBK, despite it
actually allowing GBK to be more compatible with its former self.

whatwg/encoding#336 now standardizes the
behavior that GBK and gb18030 are to remain aligned in these matters
and this change implements that.

The corresponding tests are from this PR:
web-platform-tests/wpt#48240

* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-decoder.any.js:
* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-encoder-expected.txt:
* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-encoder.html:
* Source/WebCore/PAL/pal/text/TextCodecCJK.cpp:
(PAL::gb18030AsymmetricEncode):
(PAL::gbEncodeShared):

Canonical link: https://commits.webkit.org/283987@main
annevk added a commit to web-platform-tests/wpt that referenced this pull request Sep 21, 2024
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Sep 25, 2024
… a=testonly

Automatic update from web-platform-tests
Encoding: impact of GB18030-2022 on GBK

See whatwg/encoding#336 for details.
--

wpt-commits: 1ac8deee082ecfb5d3b6f9c56cf9d1688a2fc218
wpt-pr: 48240
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Sep 26, 2024
… a=testonly

Automatic update from web-platform-tests
Encoding: impact of GB18030-2022 on GBK

See whatwg/encoding#336 for details.
--

wpt-commits: 1ac8deee082ecfb5d3b6f9c56cf9d1688a2fc218
wpt-pr: 48240

UltraBlame original commit: 0917498f655f23ccc2b8f8a9bbb45759cc8ed8ff
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Sep 26, 2024
… a=testonly

Automatic update from web-platform-tests
Encoding: impact of GB18030-2022 on GBK

See whatwg/encoding#336 for details.
--

wpt-commits: 1ac8deee082ecfb5d3b6f9c56cf9d1688a2fc218
wpt-pr: 48240

UltraBlame original commit: 0917498f655f23ccc2b8f8a9bbb45759cc8ed8ff
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Sep 26, 2024
… a=testonly

Automatic update from web-platform-tests
Encoding: impact of GB18030-2022 on GBK

See whatwg/encoding#336 for details.
--

wpt-commits: 1ac8deee082ecfb5d3b6f9c56cf9d1688a2fc218
wpt-pr: 48240

UltraBlame original commit: 0917498f655f23ccc2b8f8a9bbb45759cc8ed8ff
jamienicol pushed a commit to jamienicol/gecko that referenced this pull request Sep 26, 2024
… a=testonly

Automatic update from web-platform-tests
Encoding: impact of GB18030-2022 on GBK

See whatwg/encoding#336 for details.
--

wpt-commits: 1ac8deee082ecfb5d3b6f9c56cf9d1688a2fc218
wpt-pr: 48240
@annevk
Copy link
Member Author

annevk commented Oct 2, 2024

Given that there's been no further feedback I plan on merging this tomorrow. I'll update the commit message to note that this change also more clearly documents the -2000 to -2005 change.

@annevk annevk merged commit 2c3853e into main Oct 4, 2024
2 checks passed
@annevk annevk deleted the gb18030-2022-take-2 branch October 4, 2024 08:30
lexborisov added a commit to lexbor/lexbor that referenced this pull request Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. normative
Development

Successfully merging this pull request may close these issues.

Reflect changes in GB 18030-2022 If gb18030 is revised, consider aligning the Encoding Standard
3 participants