Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedupe location names against script-root #5966

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jan 9, 2025

Fixes #5901

I'm not introducing script fallback (because time zone names are not meant to fall back by script), and I'm not introducing und-Xxxx locales, as those would increase data size (due to the other fields in the struct). Instead, I store a dedupe_locale in each data struct, which contains the locale that was deduplicated against. So for sr, this would be ru, because sr -> und-Cyrl -> ru, and for ru it'd be ru as well, and ru is complete. This way, we also directly know which dedupe locale to load; if we didn't store it we'd have to load a fallbacker and step through it until we hit und (and then use the previous value).

Depends on #5967

Copy link

dpulls bot commented Jan 9, 2025

🎉 All dependencies have been resolved !

@robertbastian robertbastian force-pushed the scriptfb branch 2 times, most recently from 8aa3ffd to 1e46e4f Compare January 9, 2025 15:25
Copy link

dpulls bot commented Jan 9, 2025

🎉 All dependencies have been resolved !

@Manishearth Manishearth removed their request for review January 9, 2025 17:16
@robertbastian robertbastian removed the request for review from zbraniecki January 9, 2025 18:28
@sffc
Copy link
Member

sffc commented Jan 10, 2025

Fixes #5901

Do we not do similar deduplication in the other time zone keys as well?

@sffc
Copy link
Member

sffc commented Jan 10, 2025

So the problem with dedupe_locale is that the fallback engine has no knowledge that it needs to keep both payloads around. If you pass LocaleFamily("sr") into datagen, datagen is free to remove ru, but if we redefine the key to script fallback, it will know that it needs to retain und-Cyrl. (well, unless no-ancestor mode is used; but I would not block landing that since no-ancestor is less common and a power-user tool.)

@sffc
Copy link
Member

sffc commented Jan 10, 2025

Hmm. An approach that would work with our framework would be to have two keys:

  1. location_base
  2. location_override

Every locale has an entry in each key, but location_base has one unique payload per script, and location_override contains smaller payloads, which are empty if the data exactly equals the base.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes because this code won't work with data slicing.

@robertbastian
Copy link
Member Author

robertbastian commented Jan 13, 2025

I do not want to change this to script fallback, because it should definitely do region fallback.

The two-key approach is pretty complex, the base keys would have to contain the intersection of all data structs that are deduped against the locale, which is hard to generate in our current framework (it's also not clear if grouping by script leads to the most deduping).

This currently has a data size improvement of <10%, I'm not sure it's worth pursuing at the moment.

@robertbastian robertbastian marked this pull request as draft January 13, 2025 11:30
@sffc
Copy link
Member

sffc commented Jan 13, 2025

I do not want to change this to script fallback, because it should definitely do region fallback.

Hm? Right now it is language fallback. Script fallback is identical except that it adds und-Script in the fallback chain.

The two-key approach is pretty complex, the base keys would have to contain the intersection of all data structs that are deduped against the locale, which is hard to generate in our current framework (it's also not clear if grouping by script leads to the most deduping).

Hm? I don't understand this, either. Generating the base key is trivial. It is equal to the language data for the likely language of the locale's script. We could do something fancier but that's not what I suggested above

This currently has a data size improvement of <10%, I'm not sure it's worth pursuing at the moment.

1'407'337 B to 1'297'285 B, or 110 kB.

It's about the raw data size improvement, yes, but also about language equity and not privileging English.

@sffc
Copy link
Member

sffc commented Jan 20, 2025

@robertbastian Can you clarify your comment about why you don't want to use script fallback?

@sffc sffc added the 2.0-breaking Changes that are breaking API changes label Jan 20, 2025
@robertbastian
Copy link
Member Author

robertbastian commented Jan 22, 2025

Hm? Right now it is language fallback. Script fallback is identical except that it adds und-Script in the fallback chain.

I guess that's right.

Generating the base key is trivial. It is equal to the language data for the likely language of the locale's script.

It is not. en contains US-specific display names (like "CT") that en-GB should not fall back to. Nevermind, that's non-location. Maybe it is trivial but this is not a priority for me right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0-breaking Changes that are breaking API changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider using script fallback for time zone names and maybe others
2 participants