Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization Failure with Bitsandbytes on SageMaker TGI Deployment: Compatibility Issue? #2467

Open
imadoualid opened this issue Aug 28, 2024 · 2 comments

Comments

@imadoualid
Copy link

System Info

I'm trying to deploy Zephyr 7b on a SageMaker endpoint using TGI. However, I noticed that quantization is not applied, and the logs indicate, 'Bitsandbytes doesn't work with CUDA graphs, deactivating them.' Here are the relevant logs :
#033[2m2024-08-28T16:31:39.278738Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "HuggingFaceH4/zephyr-7b-beta", revision: None, validation_workers: 2, sharded: None, num_shard: Some( 1, ), quantize: Some( BitsandbytesNF4, ), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 2048, ), max_total_tokens: Some( 4096, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/tmp", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, } #033[2m2024-08-28T16:31:39.278895Z#033[0m #033[32m INFO#033[0m #033[2mhf_hub#033[0m#033[2m:#033[0m Token file not found "/root/.cache/huggingface/token" #033[2m2024-08-28T16:31:39.466839Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Default max_batch_prefill_tokens to 2098 #033[2m2024-08-28T16:31:39.466863Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Bitsandbytes doesn't work with cuda graphs, deactivating them #033[2m2024-08-28T16:31:39.466953Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting check and download process for HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:31:43.674203Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00001-of-00008.safetensors #033[2m2024-08-28T16:31:47.383748Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00001-of-00008.safetensors in 0:00:03. #033[2m2024-08-28T16:31:47.383863Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [1/8] -- ETA: 0:00:21 #033[2m2024-08-28T16:31:47.384137Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00002-of-00008.safetensors #033[2m2024-08-28T16:31:49.038213Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00002-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:49.038294Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [2/8] -- ETA: 0:00:15 #033[2m2024-08-28T16:31:49.038520Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00003-of-00008.safetensors #033[2m2024-08-28T16:31:50.716613Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00003-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:50.716708Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [3/8] -- ETA: 0:00:11.666665 #033[2m2024-08-28T16:31:50.716933Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00004-of-00008.safetensors #033[2m2024-08-28T16:31:52.462098Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00004-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:52.462199Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [4/8] -- ETA: 0:00:08 #033[2m2024-08-28T16:31:52.462415Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00005-of-00008.safetensors #033[2m2024-08-28T16:31:54.155127Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00005-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:54.155206Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [5/8] -- ETA: 0:00:06 #033[2m2024-08-28T16:31:54.155483Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00006-of-00008.safetensors #033[2m2024-08-28T16:31:56.041800Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00006-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:56.041883Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [6/8] -- ETA: 0:00:04 #033[2m2024-08-28T16:31:56.042111Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00007-of-00008.safetensors #033[2m2024-08-28T16:31:58.236431Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00007-of-00008.safetensors in 0:00:02. #033[2m2024-08-28T16:31:58.236513Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [7/8] -- ETA: 0:00:02 #033[2m2024-08-28T16:31:58.236740Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00008-of-00008.safetensors #033[2m2024-08-28T16:31:59.153703Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00008-of-00008.safetensors in 0:00:00. #033[2m2024-08-28T16:31:59.153782Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [8/8] -- ETA: 0 #033[2m2024-08-28T16:31:59.787365Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights for HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:31:59.787579Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:09.797351Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:12.038243Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0 #033[2m2024-08-28T16:32:12.100255Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 12.312095644s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:12.197827Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting Webserver #033[2m2024-08-28T16:32:12.211828Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m228:#033[0m Using the Hugging Face API #033[2m2024-08-28T16:32:12.211862Z#033[0m #033[32m INFO#033[0m #033[2mhf_hub#033[0m#033[2m:#033[0m #033[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs#033[0m#033[2m:#033[0m#033[2m55:#033[0m Token file not found "/root/.cache/huggingface/token" #033[2m2024-08-28T16:32:12.568741Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m577:#033[0m Serving revision b70e0c9a2d9e14bd1e812d3c398e5f313e93b473 of model HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:32:12.611766Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m342:#033[0m Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205 #033[2m2024-08-28T16:32:12.611794Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m357:#033[0m Using config Some(Mistral) #033[2m2024-08-28T16:32:12.611799Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m384:#033[0m Invalid hostname, defaulting to 0.0.0.0 #033[2m2024-08-28T16:32:12.680568Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1572:#033[0m Warming up model #033[2m2024-08-28T16:32:14.286393Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Cuda Graphs are disabled (CUDA_GRAPHS=None). #033[2m2024-08-28T16:32:14.286772Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1599:#033[0m Using scheduler V3 #033[2m2024-08-28T16:32:14.286793Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1651:#033[0m Setting max batch total tokens to 122080 #033[2m2024-08-28T16:32:14.299701Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1889:#033[0m Connected

this is the code i'am using :

import json
from sagemaker.huggingface import HuggingFaceModel

instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 300
endpoint_name = sagemaker.utils.name_from_base("zephyr-tgi-endpoint")
config = {
  'HF_MODEL_ID': "HuggingFaceH4/zephyr-7b-beta", #"/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),
  'MAX_TOTAL_TOKENS' : json.dumps(4096),
  'QUANTIZE': "bitsandbytes-nf4",
   
}
llm_image = "763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0"

llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

llm = llm_model.deploy(
  endpoint_name=endpoint_name,

  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)```

I'm wondering if the issue is related to the TGI image or if quantization with Bitsandbytes is not yet supported. Could you provide guidance on this?

### Information

- [ ] Docker
- [ ] The CLI directly

### Tasks

- [ ] An officially supported command
- [ ] My own modifications

### Reproduction

import json
from sagemaker.huggingface import HuggingFaceModel

instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 300
endpoint_name = sagemaker.utils.name_from_base("zephyr-tgi-endpoint")
config = {
  'HF_MODEL_ID': "HuggingFaceH4/zephyr-7b-beta", #"/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),
  'MAX_TOTAL_TOKENS' : json.dumps(4096),
  'QUANTIZE': "bitsandbytes-nf4",
   
}
llm_image = "763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0"

llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

llm = llm_model.deploy(
  endpoint_name=endpoint_name,

  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

### Expected behavior

When deploying the Zephyr 7B model on a SageMaker endpoint using the TGI configuration, I would expect the quantization specified in the deployment settings (BitsandbytesNF4) to be successfully applied. The model should initialize and run with the enhanced performance and reduced resource consumption that quantization typically offers, without compatibility issues or errors regarding CUDA graphs.
@ErikKaum
Copy link
Member

ErikKaum commented Sep 3, 2024

Hi @imadoualid 👋

Thanks for reporting this. Let me get back to you on the specific question: I'm not a 100% sure if CUDA graphs and BNB have some compatibility issues.

With regards to quantisation one note is that BNB does come with a known performance hit. Have have a small section on quantisation in out docs which might guide you to choosing the right direction: https://huggingface.co/docs/text-generation-inference/en/conceptual/quantization#quantization-with-bitsandbytes-eetq--fp8

@ErikKaum
Copy link
Member

ErikKaum commented Sep 3, 2024

Yes, it seems that BNB is not compatible with cuda graphs:

tracing::warn!("Bitsandbytes doesn't work with cuda graphs, deactivating them");

The reason is that for cuda graphs to work, all memory must be allocated beforehand. And there seems to be some runtime memory allocation in BNB.

Would some other quantisation method solve your problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants