Update dependency bitsandbytes to v0.45.0 #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==0.42.0
->==0.45.0
Warning
Some dependencies could not be looked up. Check the warning logs for more information.
Release Notes
bitsandbytes-foundation/bitsandbytes (bitsandbytes)
v0.45.0
: : LLM.int8() support for H100; faster 4-bit/8-bit inferenceCompare Source
Highlights
H100 Support for LLM.int8()
PR #1401 brings full LLM.int8() support for NVIDIA Hopper GPUs such as the H100, H200, and H800!
As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the
col32
or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.Performance Improvements
This release includes broad performance improvements for a wide variety of inference scenarios. See this X thread for a detailed explanation.
The improvements were measured using the 🤗optimum-benchmark tool.
For more benchmark results, see benchmarking/README.md.
LLM.int8()
Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:
Example throughput improvement for Qwen 2.5 3B Instruct on T4:
NF4/FP4
Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:
Example throughput improvement for Qwen 2.5 3B Instruct on T4:
Changes
Packaging Changes
The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.
CUDA Toolkit Versions
Breaking
🤗PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to
peft>=0.14.0
.New
bitsandbytes.functional.int8_vectorwise_dequant()
. This functionality is being integrated into 🤗PEFT and 🤗transformers.bitsandbytes.functional
module now has an API documentation page.Deprecations
A number of public API functions have been marked for deprecation and will emit
FutureWarning
when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.k-bit quantization
The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using
block_wise=False
is not recommended and support will be removed in a future release.LLM.int8() deprecations:
As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.
The following relevant functions from
bitsandbytes.functional
are now deprecated :General Deprecations
Additionally the following functions from
bitsandbytes.functional
are deprecated:What's Changed
New Contributors
Full Changelog: bitsandbytes-foundation/bitsandbytes@0.44.1...0.45.0
v0.44.1
Compare Source
What's Changed
Full Changelog: bitsandbytes-foundation/bitsandbytes@0.44.0...0.44.1
v0.44.0
: : New AdEMAMix optimizer, Embeddings quantization, and more!Compare Source
New optimizer: AdEMAMix
The AdEMAMix optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting.
We've implemented 8bit and paged variations:
AdEMAMix
,AdEMAMix8bit
,PagedAdEMAMix
, andPagedAdEMAMix8bit
. These can be used with a similar API to existing optimizers.8-bit Optimizers Update
The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in the paper which improves accuracy.
CUDA Graphs support
A fix to enable CUDA Graphs capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee!
Quantization for Embeddings
The trend of LLMs to use larger vocabularies continues. The embeddings can take up a significant portion of a quantized model's footprint. We now have an implementation of
Embedding4bit
andEmbedding8bit
thanks to @galqiwi!Example usage:
Continuous Builds
We are now building binary wheels for each change on
main
. These builds can be used to preview upcoming changes.🚤 Continuous Build
What's Changed
move_to_device
kwarg to the optimizer'sload_state_dict
by @koute in https://github.com/bitsandbytes-foundation/bitsandbytes/pull/1344New Contributors
Full Changelog: bitsandbytes-foundation/bitsandbytes@0.43.3...v0.44.0
v0.43.3
Compare Source
Improvements:
Params4bit.__new__
post PR #970. It supports models exported with non-defaultquant_storage
, such as this NF4 model with BF16 storage.v0.43.2
Compare Source
This release is quite significant as the QLoRA bug fix big implications for higher
seqlen
and batch sizes.For each sequence (i.e. batch size increase of one) we expect memory savings of:
seqlen=1024
, and 4888GB forseqlen=128,00
seqlen=1024
and 1258GB forseqlen=128,00
This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.
Improvements:
Bug Fixes
str2optimizer32bit
(#1222, thanks @EtienneDosSantos)v0.43.1
Compare Source
Improvements:
Bug Fixes
Internal Improvements:
v0.43.0
Compare Source
Improvements and New Features:
Bug Fixes:
Backwards Compatibility
v0.42
tov0.43
, when using 4bit quantization, models may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the code. For anyone interested in the details, see this comment.Internal and Build System Enhancements:
Contributors:
This release is made possible thanks to the many active contributors that submitted PRs and many others who contributed to discussions, reviews, and testing. Your efforts greatly enhance the library's quality and user experience. It's truly inspiring to work with such a dedicated and competent group of volunteers and professionals!
We give a special thanks to @TimDettmers for managing to find a little bit of time for valuable consultations on critical topics, despite preparing for and touring the states applying for professor positions. We wish him the utmost success!
We also extend our gratitude to the broader community for your continued support, feedback, and engagement, which play a crucial role in driving the library's development forward.
Configuration
📅 Schedule: Branch creation - "after 5am on saturday" (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Renovate Bot.