Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dependency bitsandbytes to v0.45.0 #8

Open
wants to merge 1 commit into
base: rhoai-2.16
Choose a base branch
from

Conversation

red-hat-konflux[bot]
Copy link

@red-hat-konflux red-hat-konflux bot commented Nov 23, 2024

This PR contains the following updates:

Package Update Change
bitsandbytes minor ==0.42.0 -> ==0.45.0

Warning

Some dependencies could not be looked up. Check the warning logs for more information.


Release Notes

bitsandbytes-foundation/bitsandbytes (bitsandbytes)

v0.45.0: : LLM.int8() support for H100; faster 4-bit/8-bit inference

Compare Source

Highlights

H100 Support for LLM.int8()

PR #​1401 brings full LLM.int8() support for NVIDIA Hopper GPUs such as the H100, H200, and H800!

As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.

Performance Improvements

This release includes broad performance improvements for a wide variety of inference scenarios. See this X thread for a detailed explanation.

The improvements were measured using the 🤗optimum-benchmark tool.

For more benchmark results, see benchmarking/README.md.

LLM.int8()
  • Turing/Ampere/Ada: The observed per-token throughput is improved by 60-85%, while latency is decreased by 40-45%.
  • H100: With our benchmarking of Llama 3.1 70B, we observed the new LLM.int8() to consistently outperform NF4 at batch size >= 8.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

  • Batch size = 1: 9.05 tokens/s => 15.44 tokens/s
  • Batch size = 8: 66.62 tokens/s => 110.95 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

  • Batch size = 1: 3.34 tokens/s => 5.98 tokens/s
  • Batch size = 8: 24.28 tokens/s => 44.15 tokens/s
NF4/FP4
  • Turing/Ampere/Ada: With batch size of 1, per-token throughput is improved by 10-25% and per-token latency is decreased by 10-20%.
  • H100: Across all batch sizes, per-token throughput is improved by up to 28% and per-token latency is decreased by up to 22%.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

  • Batch size = 1: 31.46 tokens/s => 39.03 tokens/s
  • Batch size = 8: 110.70 tokens/s => 111.29 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

  • Batch size = 1: 11.05 tokens/s => 13.58 tokens/s
  • Batch size = 8: 69.8 tokens/s => 76.80 tokens/s

Changes

Packaging Changes

The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.

CUDA Toolkit Versions
  • Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
  • The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.
Breaking

🤗PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to peft>=0.14.0.

New

Deprecations

A number of public API functions have been marked for deprecation and will emit FutureWarning when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.

k-bit quantization

The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using block_wise=False is not recommended and support will be removed in a future release.

LLM.int8() deprecations:

As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.

The following relevant functions from bitsandbytes.functional are now deprecated :

  • dequant_min_max
  • dequantize_no_absmax
  • extract_outliers
  • get_special_format_str
  • get_transform_buffer
  • get_transform_func
  • mm_dequant (replacement: int8_mm_dequant)
  • igemmlt (replacement: int8_linear_matmul)
  • nvidia_transform
  • transform
  • quantize_no_absmax
  • vectorwise_dequant
  • vectorwise_quant (~replacement: int8_vectorwise_quant)
  • vectorwise_mm_dequant (~replacement: int8_mm_dequant)
General Deprecations

Additionally the following functions from bitsandbytes.functional are deprecated:

  • _mul
  • arange
  • post_call
  • pre_call

What's Changed

New Contributors

Full Changelog: bitsandbytes-foundation/bitsandbytes@0.44.1...0.45.0

v0.44.1

Compare Source

What's Changed

Full Changelog: bitsandbytes-foundation/bitsandbytes@0.44.0...0.44.1

v0.44.0: : New AdEMAMix optimizer, Embeddings quantization, and more!

Compare Source

New optimizer: AdEMAMix

The AdEMAMix optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting.

We've implemented 8bit and paged variations: AdEMAMix, AdEMAMix8bit, PagedAdEMAMix, and PagedAdEMAMix8bit. These can be used with a similar API to existing optimizers.

import bitsandbytes as bnb

optimizer = bnb.optim.PagedAdEMAMix8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999, 0.9999),
    alpha=5.0,
    eps=1e-8,
    weight_decay=1e-2,
    alpha=5.0,
)

8-bit Optimizers Update

The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in the paper which improves accuracy.

CUDA Graphs support

A fix to enable CUDA Graphs capture of kernel functions was made in #​1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @​jeejeelee!

Quantization for Embeddings

The trend of LLMs to use larger vocabularies continues. The embeddings can take up a significant portion of a quantized model's footprint. We now have an implementation of Embedding4bit and Embedding8bit thanks to @​galqiwi!

Example usage:

import torch
import torch.nn as nn

from bitsandbytes.nn import Embedding4bit

fp16_module = nn.Embedding(128, 64)
quantized_module = Embedding4bit(128, 64)

quantized_module.load_state_dict(fp16_module.state_dict())

quantized_module = quantized_module.to(0)

Continuous Builds

We are now building binary wheels for each change on main. These builds can be used to preview upcoming changes.

🚤 Continuous Build

What's Changed

New Contributors

Full Changelog: bitsandbytes-foundation/bitsandbytes@0.43.3...v0.44.0

v0.43.3

Compare Source

Improvements:
  • FSDP: Enable loading prequantized weights with bf16/fp16/fp32 quant_storage
    • Background: This update, linked to Transformer PR #​32276, allows loading prequantized weights with alternative storage formats. Metadata is tracked similarly to Params4bit.__new__ post PR #​970. It supports models exported with non-default quant_storage, such as this NF4 model with BF16 storage.
    • Special thanks to @​winglian and @​matthewdouglas for enabling FSDP+QLoRA finetuning of Llama 3.1 405B on a single 8xH100 or 8xA100 node with as little as 256GB system RAM.

v0.43.2

Compare Source

This release is quite significant as the QLoRA bug fix big implications for higher seqlen and batch sizes.

For each sequence (i.e. batch size increase of one) we expect memory savings of:

  • 405B: 39GB for seqlen=1024, and 4888GB for seqlen=128,00
  • 70B: 10.1GB for seqlen=1024 and 1258GB for seqlen=128,00

This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.

Improvements:
Bug Fixes

v0.43.1

Compare Source

Improvements:
  • Improved the serialization format for 8-bit weights; this change is fully backwards compatible. (#​1164, thanks to @​younesbelkada for the contributions and @​akx for the review).
  • Added CUDA 12.4 support to the Linux x86-64 build workflow, expanding the library's compatibility with the latest CUDA versions. (#​1171, kudos to @​matthewdouglas for this addition).
  • Docs enhancement: Improved the instructions for installing the library from source. (#​1149, special thanks to @​stevhliu for the enhancements).
Bug Fixes
  • Fix 4bit quantization with blocksize = 4096, where an illegal memory access was encountered. (#​1160, thanks @​matthewdouglas for fixing and @​YLGH for reporting)
Internal Improvements:

v0.43.0

Compare Source

Improvements and New Features:
Bug Fixes:
  • Addressed a race condition in kEstimateQuantiles, enhancing the reliability of quantile estimation in concurrent environments (@​pnunna93, #​1061).
  • Fixed various minor issues, including typos in code comments and documentation, to improve code clarity and prevent potential confusion (@​Brian Vaughan, #​1063).
Backwards Compatibility
  • After upgrading from v0.42 to v0.43, when using 4bit quantization, models may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the code. For anyone interested in the details, see this comment.
Internal and Build System Enhancements:
  • Implemented several enhancements to the internal and build systems, including adjustments to the CI workflows, portability improvements, and build artifact management. These changes contribute to a more robust and flexible development process, ensuring the library's ongoing quality and maintainability (@​rickardp, @​akx, @​wkpark, @​matthewdouglas; #​949, #​1053, #​1045, #​1037).
Contributors:

This release is made possible thanks to the many active contributors that submitted PRs and many others who contributed to discussions, reviews, and testing. Your efforts greatly enhance the library's quality and user experience. It's truly inspiring to work with such a dedicated and competent group of volunteers and professionals!

We give a special thanks to @​TimDettmers for managing to find a little bit of time for valuable consultations on critical topics, despite preparing for and touring the states applying for professor positions. We wish him the utmost success!

We also extend our gratitude to the broader community for your continued support, feedback, and engagement, which play a crucial role in driving the library's development forward.


Configuration

📅 Schedule: Branch creation - "after 5am on saturday" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

Signed-off-by: red-hat-konflux <126015336+red-hat-konflux[bot]@users.noreply.github.com>
@red-hat-konflux red-hat-konflux bot force-pushed the konflux/mintmaker/rhoai-2.16/bitsandbytes-0.x branch from 0b51288 to 3e9df2d Compare December 7, 2024 12:56
@red-hat-konflux red-hat-konflux bot changed the title Update dependency bitsandbytes to v0.44.1 Update dependency bitsandbytes to v0.45.0 Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants