-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make model key assignment deterministic #5792
Conversation
2f9f698
to
225a41d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a problem with the SHA1 implementation. It uses a block size when reading the file, causing the hash to be incorrect.
I was curious so I whipped up a benchmark of file hashing in python: https://github.com/psychedelicious/hash_bench
SHA1
with block size is by far the fastest, but BLAKE3
is still way faster than any other correct algorithm. I think we should use this for single-file models.
CivitAI provides BLAKE3
hashes, and you can query for a model on their API using it: https://civitai.com/api/v1/model-versions/by-hash/39d4e86885924311d47475838b8e10edb7a33c780934e9fc9d38b80c27cb658e
(that's DreamShaper XL something or other - used in my tests)
There are a few ways to implement BLAKE3
, see my benchmark repo for the fastest method, which parallelizes hashing and uses memory mapping (I don't really understand what this means but it makes it super fast).
From a general DB design perspective, I'd be inclined to keep the random primary key and add the hash as another column on the table. (Which I think is what we had discussed when we originally decided to use a random key.) This would be better if we think there's any chance that the definition of the hash key will change in the future, or it's uniqueness constraint will be dropped. Example situations:
Up to you if you think the added effort now is worth it for the future-proofing. |
The PK can be anything but we need to be able to reference models using a stable, deterministic identifier. There's no point in having model metadata if it isn't a stable reference to the same model. Using a cryptographic hash also means metadata is useful between users. Also, I think we should consider dropping imohash and instead use b3 to hash the diffusers weights. Iterate over all files and update the hash as you go. No need to rely on the imohash library. |
Huh. I didn't realize that I was hashing incorrectly. The web is filled with misleading info, I guess. The
Python:
Command line tool:
The only reason I chose |
Oh, just saw the blake3 benchmark results. That is very fast. Why don't we just go with that and stick with it? |
We are trying to satisfy multiple use cases, ordered in decreasing level of priority.
With respect to (1), we need to adopt deterministic model identifiers rather than randomly-assigned ones. This seems obvious now, but recall that when we originally discussed the MM2 redesign in response to Spencer's RFC last summer we had a consensus that the identifiers should be random. Thoughts:
Overall, I think my preference would be to use |
The other thing we should discuss before merging this PR is how hashes are represented in the model config. The pydantic model is currently:
Civitai computes multiple hashes, and maybe we should allow for similar flexibility in the future. One approach is:
This would let us apply multiple hashes to support other model sources. Downside is that it would necessitate a database migration. Another approach would simply to adopt a convention of prefixing the hash with its algorithm name, as in |
4801979
to
ee78412
Compare
- When installing, model keys are now calculated from the model contents. - .safetensors, .ckpt and other single file models are hashed with sha1 - The contents of diffusers directories are hashed using imohash (faster) fixup yaml->sql db migration script to assign deterministic key - this commit also detects and assigns the correct image encoder for ip adapter models.
- Some algos are slow, so it is now just called ModelHash - Added all hashlib algos, plus BLAKE3 and the fast (but incorrect) SHA1 algo
ee78412
to
48ca841
Compare
This changes the functionality of this PR to only use the updated hashing for model hashes with a UUID for the key.
- Use memory view for hashlib algorithms (closer to python 3.11's filehash API in hashlib) - Remove `sha1_fast` (realized it doesn't even hash the whole file, it just does the first block) - Add support for custom file filters - Update docstrings - Update tests
48ca841
to
9f0bf65
Compare
I've been moving back and forth between this PR and #5846, in which My recent commits thus change this PR from "Make model key assignment deterministic" to "Improved model hashing". |
What type of PR is this? (check all applicable)
Have you discussed this change with the InvokeAI team?
Have you updated all relevant documentation?
Description
Previously, when a model was installed, it was assigned a random database key. This caused issues because if the model was reinstalled, its key would change. It also makes metadata irreproducible. This PR changes the behavior so that keys are assigned determinstically based on the model contents
Related Tickets & Documents
See discord: https://discord.com/channels/1020123559063990373/1149513647022948483/1210441942249377792
QA Instructions, Screenshots, Recordings
Try installing and uninstalling a model, using the command line
invokeai-model-install --add <url or repoid>
, followed byinvokeai-mode-install --delete <name of model>
. Both safetensors URLs and HF diffusers should get the same key each time.Merge Plan
Can merge when approved.
Added/updated tests?
[optional] Are there any post deployment tasks we need to perform?