Skip to content

Fix unstable tokenizer fingerprinting (enables map cache reuse)#7982

Open
KOKOSde wants to merge 1 commit intohuggingface:mainfrom
KOKOSde:perf/stable-tokenizer-fingerprint
Open

Fix unstable tokenizer fingerprinting (enables map cache reuse)#7982
KOKOSde wants to merge 1 commit intohuggingface:mainfrom
KOKOSde:perf/stable-tokenizer-fingerprint

Conversation

@KOKOSde
Copy link

@KOKOSde KOKOSde commented Feb 2, 2026

Fix unstable dataset fingerprinting when hashing PreTrainedTokenizerFast.

Some tokenizers backed by tokenizers.Tokenizer mutate runtime settings (padding/truncation) when called, which can change the serialized state and make dataset fingerprints unstable. That prevents .map(load_from_cache_file=True) from reusing cache files.

Fix: when hashing, temporarily disable backend padding/truncation so runtime settings don’t affect the fingerprint, then restore the original settings.

Includes a regression test showing Hasher.hash(tokenizer) stays stable after calling the tokenizer.

@KOKOSde KOKOSde force-pushed the perf/stable-tokenizer-fingerprint branch 2 times, most recently from 8c1891b to 347e84a Compare February 5, 2026 05:29
Tokenizers backed by `tokenizers` can mutate truncation/padding state when called, which made dataset transform fingerprints unstable and prevented `.map(load_from_cache_file=True)` from reusing cached results.

This change makes tokenizer hashing stable by temporarily clearing backend truncation/padding during serialization for fingerprinting, then restoring it.

Add a regression test and a simple benchmark to demonstrate cache-hit speedups.

Fixes huggingface#3847
@KOKOSde KOKOSde force-pushed the perf/stable-tokenizer-fingerprint branch from 347e84a to 1792715 Compare February 5, 2026 05:41
@KOKOSde
Copy link
Author

KOKOSde commented Feb 9, 2026

Hi! It looks like the GitHub Actions check suites for this PR are in action_required (no workflows actually ran). This is usually due to fork workflow approval.

Could a maintainer please approve/run the workflows so CI can execute? Happy to address anything CI flags once it runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant