-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
122 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
--- | ||
title: IETF BCP 47 | ||
date: 2024-08-02 | ||
draft: true | ||
--- | ||
|
||
_(Draft post.)_ | ||
|
||
In discussions with clients as well as among the Digitally Disadvantaged Languages community, I’ve had discussions about [BCP 47][] Language Tags. I’m writing this up as a general overview, meant to benefit the globalization commuity. | ||
|
||
## What is a Language Tag? | ||
|
||
To introduce this topic generally, a _language tag_ is a short string that refers minimally to a language (`fr` for French), but can optionally include additional context. | ||
|
||
## Who uses this? | ||
|
||
Language Tags are used primarily for two different purposes: | ||
|
||
1. Identifying a document or other resource. For example, `<html lang="fr">` specifies that the HTML document is in French. | ||
|
||
2. Identifying a user’s own language (and other) preferences. In this usage, the language tags are used as a "locale code". For example, `fr-CA` specifies a French user in Canada. | ||
|
||
## Haven't I seen these before? | ||
|
||
One might ask, _Haven’t these codes been around for a long time, even before 2006? These aren't always referred to as BCP 47._ | ||
|
||
Well, yes and no. By way of analogy, Unicode UTF-8 is backwards compatible with ASCII. In other words, the string `Hello` is valid ASCII (1963) — but it’s also valid UTF-8. Similarly, `fr` and `fr-CA` are valid BCP 47, but also valid [RFC 3066][] and even [RFC 1766][] (1995). | ||
|
||
So, just as with a move to Unicode, you can move to BCP 47 by leveraging what you are already doing for language tags / locale preferences. | ||
|
||
## What is BCP 47, formally? | ||
|
||
[IETF][] is the Internet Engineering Task Force, responsible for the standards that keep the Internet running. An “RFC” is a Request For Comment — a document put out in public for review. These documents have [a wide variety of statuses][rfcstatus], from purely informational, to core standards that are widely implemented. BCP 47 is a “Best Common Practice” and actually consists of two documents: | ||
|
||
- [RFC 4647][] _Matching of Language Tags_ | ||
|
||
This defines how to take user’s preferences of a tag or list of tags, | ||
such as say `fr-CA`, and find a matching document or set of locale data which satisfies the user’s request. In our example, the document `fr` would match. | ||
|
||
- [RFC 5646][] _Tags for Identifying Languages_ | ||
|
||
This specifies the structure of the tags themselves, including private use and extensions. | ||
|
||
## OK, let’s dig into some sample codes! | ||
|
||
The following sections explore various BCP 47 codes. We will “disassemble” them, to see how they tick. | ||
|
||
### `fr` | ||
|
||
This refers to French. The _language_ part of the language tag is taken from [ISO 639][]. It needs to be the shortest form, so it is `fr` and not `fra`. | ||
|
||
Easy enough? Let’s add some more complications. | ||
|
||
### `ssy` | ||
|
||
This is the Saho language. Note that this is a three letter code, We have left "Set 1" of [ISO 639][] and are now using a code from the other sets. | ||
|
||
It’s important to note that language codes can be one or two letters. If three letters aren’t supported, software is restricted to supporting the “major, mostly national” languages (to quote ISO). | ||
|
||
### `fr-CA` | ||
|
||
French in Canada. The _region_ part of the tag _in this case_ is from [ISO 3166][]. | ||
|
||
Note that I said _region_ and not _country_. A simple example is that `AQ`, Antarctica, is not a country (as of this writing). There are more complex examples, but that is because what constitutes a "country" is complex. Calling it a "region" is more accurate, and avoids some geopolitical challenges. Plus, see the next example: | ||
|
||
### `fr-002` | ||
|
||
Okay, what is this? French… in Africa! | ||
|
||
`002` here is a region, specifically, a UN [M.49][] numerical code for “Geographic Regions.” The UN statistics division also helpfully provides a hierarchy of which regions are subregions of others, starting with `001` (the world). So for example: | ||
|
||
- `001` (World) contains | ||
- `002` (Africa) which contains | ||
- `015` (Northern Africa) which itself contains regions such as | ||
- `DZ` (Algeria), `EG`, (Egypt), … | ||
|
||
This hierarchy can be very useful when referring to, for example, a document, preference, or setting that applies to multiple regions. In some situations, `fr-002` can be used to refer to `fr-DZ`, `fr-EG`, etc. | ||
|
||
### `sr-Cyrl` and `sr-Latn` | ||
|
||
The Serbian language can be written with two different scripts: Cyrillic ("српски") and Latin ("srpski"). Notice here that the second component is _not_ a region code, but an [ISO 15924][] script code. | ||
|
||
Let's talk about chopping. Chopping is an old practice whereby locale tags are "chopped up" into hyphen separated segments. | ||
|
||
Before script codes and UN M.49 codes were part of language tags, it was common to simply assume this structure: _ll-RR_ That is, a 2-letter language, and a 2-letter region. Rather than yearning for “simpler” times, I’ll point out that this is a good example of where code that worked before will get the wrong answer for BCP 47. At worst case, if you assume (not bothering to look for the hyphen) the region is the fourth and fifth character of the code, our example codes will land you in Cyprus (`CY`) and Laos (`LA`) respectively! | ||
|
||
### `zh-Hant-HK` | ||
|
||
Now we put together a few items we've picked up along the way. This is Chinese, written in the Traditional script, in Hong Kong. The "old" way to distinguish Simplified and Traditional was to use `zh` or `zh-CN` (Simplified) versus `zh-HK`/`zh-TW` (Traditional). But the correct way to do so is to use the script codes `Hans` and `Hant`. | ||
|
||
Yes, "Chinese" is a broad term. Generally, `zh` refers to Mandarin, which has a specific code `cmn`. | ||
|
||
### `el-polyton` | ||
|
||
"polyton" here is a variant code, registered with the IANA [Language Subtag Registry][iana-lsr]. Together, this tag refers to Polytonic Greek, that is, modern Greek written with polytonic accents. | ||
|
||
### `x-codehive` | ||
|
||
This is a private use code. It could be anything! But, don’t use it for open interchange, because there's no way to know what it means or prevent someone else from colliding with your usage of it. | ||
|
||
## Conclusion | ||
|
||
That wraps up our brief introduction to BCP 47. In a future installment, we'd like to cover: | ||
|
||
- more about private use codes including `-x-` and reserved codes | ||
- extensions such as `-u-` and `-v-` | ||
- how to use BCP 47 | ||
|
||
<!-- Footnotes --> | ||
|
||
[IETF]: https://ietf.org | ||
[BCP 47]: https://www.rfc-editor.org/info/bcp47 | ||
[RFC 3066]: https://www.rfc-editor.org/info/rfc3066 | ||
[RFC 1766]: https://www.rfc-editor.org/info/rfc1766 | ||
[RFC 4647]: https://www.rfc-editor.org/info/rfc4647 | ||
[RFC 5646]: https://www.rfc-editor.org/info/rfc5646 | ||
[rfcstatus]: https://www.ietf.org/process/rfcs/#streams | ||
[ISO 639]: https://www.iso.org/iso-639-language-code | ||
[ISO 3166]: https://www.iso.org/iso-3166-country-codes.html | ||
[ISO 15924]: https://www.unicode.org/iso15924/ | ||
[M.49]: https://unstats.un.org/unsd/methodology/m49/ | ||
[iana-lsr]: https://www.iana.org/assignments/lang-subtags-templates/ |