Skip to content

Commit

Permalink
content: bcp47 extensions update to extensions doc, ready for review …
Browse files Browse the repository at this point in the history
…maybe
  • Loading branch information
srl295 committed Sep 12, 2024
1 parent 206207f commit 1677ae1
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 17 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@ node_modules
*.sw?
OLD
src/mission.json
/.lycheecache
49 changes: 32 additions & 17 deletions en/posts/2024-bcp47-extensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ tags:
draft: true
---

This is the second post in a discussion of [BCP 47][] Language Tags, in which we will explore current and future extension tags.
This is the second post in a discussion of [BCP 47][] Language Tags, in which we will explore extensions to BCP 47, both current and future.

---

Expand All @@ -27,21 +27,19 @@ As of this writing, `u` and `t` are registered.

You can find the official docs for the `-u-` extension [here][u-extension], especially the [key/type definitions][u-key-type].

The -u- extensions provide a variety of additional dimensions to language tags. I will highlight a few here:
The -u- extensions provide a variety of additional dimensions to language tags. I will highlight a couple here.

- `-u-sd` is the main subtag, as of this writing, that might refer to a purely linguistic difference in the referenced content. To quote from the standard:
- `-u-sd` is one that might refer to a purely linguistic difference in the tagged content. For example, `en-US-u-sd-ustx` could refer to English as spoken in Texas (Texan!). `mt-u-sd-mt43` would be Maltese as spoken in [Ħal Qormi](https://en.wikipedia.org/wiki/Qormi_dialect) and so on.

> `en-GB-u-sd-gbsct` represents the language variant “English as used in Scotland”. And both `en-u-sd-usca` and `en-US-u-sd-usca` represent “English as used in California”
Most of the other subtags affect how software processing or producing text using [CLDR][] locale data operates. For example:

Most of the subtags affect how software processing or producing text using [CLDR][] locale data operates. For example:

- `en-u-tz-uslax` uses a UN LOCODE (in this case US-LAX, which, yes, refers to [Los Angeles International Airport](https://airportcodes.aero/lax)) that identifies, very compactly, the time zone `America/Los_Angeles`. This type of a subtag is useful to convey user preferneces, such as time zone, in environments (such as a browser context) where there otherwise isn't a way to convey such information.
- `en-u-tz-uslax` uses a UN LOCODE (in this case US-LAX, which, yes, refers to [Los Angeles International Airport](https://airportcodes.aero/lax)) that identifies, very compactly, the time zone `America/Los_Angeles`. This type of a subtag is useful to convey user preferences, such as time zone, in environments (such as a browser context) where there otherwise isn't a way to convey such information.

- `en-US-u-hc-h24` specifies that you want 24 hour time, despite the preferred value for `en-US` as normally indicating 12 hour time. Again, this allows for a level of user 'customization' and could be passed to a an API performing date formatting.

- `cs-u-co-search` specifies the Czech language, but that a different collator is requested - one optimized for searching text, instead of sorting. This useful as an argument to an API function.
- `cs-u-co-search` specifies the Czech language, but that a different collator is requested - one optimized for searching text, instead of sorting. This useful as an argument to an API function.

See [the whole list][u-key-type] for the latest details.
See [the whole list][u-key-type] for the latest details. CLDR continues to add new keys periodically as need arises.

## `t`: Transform extensions

Expand All @@ -55,16 +53,26 @@ First, some examples.

- `el-Latn-t-el-Grek` This is content that is _in_ Latin, in fact, it's Greek written in Latin, but it is transformed _from_ `el-Grek` that is, from Greek in the Greek script.

- `el-Latn-t-el-m0-bgn` This is content _in_ Latin, again Greek in Latin, but transformed from Greek (the Greek script is assumed), but according to
- `el-Latn-t-el-m0-bgn` This is content _in_ Latin, again Greek in Latin, but transformed from Greek (the Greek script is assumed), but according to
So `zoí` for example would be the transformed version of `ζωή` (which is in `el-Grek`). The `m0-bgn` specifies that the transform is according to the [United States Board on Geographic Names (BGN)][BGN].

- From the RFC, `und-Cyrl-t-und-latn-m0-ungegn-2007`. This is a transform from Latin to Cyrillic, using the 2007 version of the UN/GEGN rules.

- `hi-t-en-h0-hybrid` Hinglish! Yes, Hindi but including English. The [spec](https://www.unicode.org/reports/tr35/#Hybrid_Locale) gives some additional examples.

- Finally, `en-t-k0-colemak` can be used to specify an English keyboard, using the Colemak layout. See also [LDML Keyboards][].

## And more?

There's some discussion underway about a possible extension to allow dialects and other variants to be chosen according to ISO 2XXXXX.
Those are all of the registered extensions as of this writing. But perhaps more are needed.

For example, the [Kinyamulenge](https://en.wikipedia.org/wiki/Banyamulenge) language considered a dialect of Kinyarwanda (Rwandan, `rw`) and so does not have its own [ISO 639][] code. However, it has very distinct orthography and vocabulary. How can documents be tagged properly that are in Kinyamulenge, or, locale data be selected?

- One option might be to register an IANA [variant tag][iana-lsr], `rw-`_something_.
- Since Kinyamulenge is spoken in the South Kivu province, perhaps the CLDR `-u-` subtag could be used: `rw-u-sd-cdsk` where `CD-SK` is the [ISO 3166-2:CD][] (subdivision) code referring to South Kivu province of the Democratic Republic of the Congo.
- There's some discussion underway about a possible extension to allow dialects and other variants to be chosen according to [ISO 21636][]. Perhaps the Glottolog code [`mule1238`][mule1238] could be used here somehow: `rw-≈-gl-mule1238` (I'm using ≈ because there's no extension letter yet!)

One thing to note for all three of these approaches is that they gracefully degrade. That is, applications which don't understand the additional tags, can follow the fallback rules and they will end up with `rw` (Rwandan) which is a reasonable fallback for Kinyamulenge.

## Private use and the `x` escape hatch

Expand All @@ -76,7 +84,7 @@ Thanks to the Unicode Standard, I can include text in this blog post in ᓀᐦ

Private use refers to a _closed system_ of mutual _private agreement_, just as you'd need for a predefined "secret code" cipher book, where A=1, B=2, C=3 and so on. Only people with the code book can read your message.

In this era of having thousands of "friends", a "private agreement" might mean: Everyone on this mailing list uses this system. Or, everyone accessing this website uses this system. Better yet, "private use" would be confined to your own software. That is, you might use BCP 47 language tags including private use codes or extensions, but would not send those tags to arbitrary people or processes. The private tags would be prevented from "leaking out" as much as possible.
In this era of having thousands of "friends", a "private agreement" might mean: Everyone on this mailing list uses this system. Or, everyone accessing this website uses this system. Better yet, "private use" would be confined to your own data/software. That is, you might use BCP 47 language tags including private use codes or extensions, but would not send those tags to arbitrary people or processes. The private tags would be prevented from "leaking out" as much as possible.

### Risks of private use

Expand All @@ -90,17 +98,19 @@ Being private and not registered, there are two obvious concerns and one not so

With those caveats, let's take an overview of the world of private use tags.

### Private use language, region, and script codes
### Private use language, script, and region codes

While not a part of BCP 47 itself, it should be noted here that the [ISO 639][], [ISO 15924][] and [ISO 3166][] standards themselves include "private use" codes which will never be assigned to regular values. See those standards for details, but I will give some examples of how these might be used. Note that these remain private - mentioning them here doesn't in any way remove them from the possibility of other private use!

- Unicode [CLDR][] - for internal processing - has used among others:
- `Qaag` for Zawgyi "script"
- `QO` for Outlying Oceana
- `ZZ` for "Any Region"
- Previously used `Qaai` for "Inherited", which is now `Zinh`
- `Qaag` for Zawgyi "script"
- `QO` for Outlying Oceana
- `ZZ` for "Any Region"
- CLDR Previously used `Qaai` for "Inherited", which is now `Zinh`
- A large number of private use language codes are mapped to constructed languages for users of the [ConLang Code Registry (CLCR)][CLCR], such as `qaz` for Tolkein's Adûnaic language.

To take up the Kinyamulenge example above, one could arbitrarily use `qml` for Kinyamulenge, within a closed system.

### Private use language extension tags

We mentioned `x-` briefly last time, with the example of `x-codehive`. Actually, if the tag begins with an `x-` then the entire tag is private use. You can do whatever you want! And other software will have no idea what the tag means.
Expand All @@ -109,6 +119,8 @@ But, `-x-` can also be used as a private use extension, at the end of a language

SIL's [SLDR][] repository and langtags.json file also uses `-x-` to refer to languages and dialects that don't otherwise have an assigned tag. For example, `acr-x-rabinal` to refer to the Rabinal dialect of the [Achi](https://en.wikipedia.org/wiki/Achi_language) language.

Again to give the Kinyamulenge exmaple, one could use `rw-x-mulenge`. This could be used today (as private use) with the benefit that software processing this tag could "fall back" to `rw` Kinyarwanda.

## Conclusion


Expand All @@ -135,3 +147,6 @@ SIL's [SLDR][] repository and langtags.json file also uses `-x-` to refer to lan
[u-key-type]: https://www.unicode.org/reports/tr35/#table-key-type-definitions
[BGN]: https://geonames.nga.mil/geonames/GNSHome/welcome.html
[LDML Keyboards]: https://cldr.unicode.org/index/keyboard-workgroup
[ISO 21636]: https://www.iso.org/standard/84965.html
[ISO 3166-2:CD]: https://en.wikipedia.org/wiki/ISO_3166-2:CD
[mule1238]: https://glottolog.org/resource/languoid/id/mule1238

0 comments on commit 1677ae1

Please sign in to comment.