Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller crates.io package #112

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

kornelski
Copy link
Contributor

@kornelski kornelski commented Nov 1, 2024

I've noticed that encoding_rs is in the top 10 crates using most bandwidth of all crates.io.

I've managed to make the package 3× smaller:

  • I've separated tests that need test_data and excluded them and the data from the tarball. All other tests still work even when run from the crates.io package.
  • I've replaced [u8; 2] tables with include_bytes!. These tables were using 18 bytes to represent two. Now the bytes aren't even loaded when their feature flags are off. I haven't changed the u16 tables, because they're trickier due to alignment.

The generate-encoding-data.py is for Python 2, so I was unable to run it. I've dumped the data this way:

    for (name, data) in [
        ("big5_level1_hanzi_bytes", &BIG5_LEVEL1_HANZI_BYTES[..]),
        ("big5_unified_ideograph_bytes", &BIG5_UNIFIED_IDEOGRAPH_BYTES),
        ("jis0208_level1_kanji_shift_jis_bytes", &JIS0208_LEVEL1_KANJI_SHIFT_JIS_BYTES),
        ("jis0208_kanji_bytes", &JIS0208_KANJI_BYTES),
        ("cp949_hangul_bytes", &CP949_HANGUL_BYTES),
        ("ksx1001_unified_hanja_bytes", &KSX1001_UNIFIED_HANJA_BYTES),
        ("ksx1001_compatibility_hanja_bytes", &KSX1001_COMPATIBILITY_HANJA_BYTES),
        ("gb2312_level1_hanzi_bytes", &GB2312_LEVEL1_HANZI_BYTES),
        ("gbk_hanzi_bytes", &GBK_HANZI_BYTES),
    ] {
        let flattened: Vec<u8> = data.iter().flat_map(|x| x).copied().collect();
        std::fs::write(format!("data/{name}.bin"), flattened);
    }

The diff for deletion is huge, but commits should be viable individually.

@hsivonen
Copy link
Owner

hsivonen commented Nov 4, 2024

I acknowledge that I have seen this, but this will take more time for me to merge, because I want to port this to the generator script. (Which I need to do in a VM due to Python 2. Rewriting the generator in Rust would be nice in theory but never important enough. I tried porting the script to Python 3 in September but gave up, because Python 3 had done disruptive-enough changes to how comparison methods work to make porting non-trivial.)

People have complained about test data being included before, so I guess it's time to remove it. It remains to be seen if Linux distros will then complain about the crates.io package not being complete.

@kornelski
Copy link
Contributor Author

I've moved test data to a separate PR to unblock that change.

@kornelski
Copy link
Contributor Author

As for the generation script, maybe you can make it write the arrays to different Rust files of a helper crate, and use the Rust code I've provided as a post-processing step? It's not elegant, but that doesn't seem that important given how rarely if ever this needs to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants