Quantization Failure with Bitsandbytes on SageMaker TGI Deployment: Compatibility Issue? #2467

imadoualid · 2024-08-28T16:47:36Z

System Info

I'm trying to deploy Zephyr 7b on a SageMaker endpoint using TGI. However, I noticed that quantization is not applied, and the logs indicate, 'Bitsandbytes doesn't work with CUDA graphs, deactivating them.' Here are the relevant logs :
#033[2m2024-08-28T16:31:39.278738Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "HuggingFaceH4/zephyr-7b-beta", revision: None, validation_workers: 2, sharded: None, num_shard: Some( 1, ), quantize: Some( BitsandbytesNF4, ), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 2048, ), max_total_tokens: Some( 4096, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/tmp", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, } #033[2m2024-08-28T16:31:39.278895Z#033[0m #033[32m INFO#033[0m #033[2mhf_hub#033[0m#033[2m:#033[0m Token file not found "/root/.cache/huggingface/token" #033[2m2024-08-28T16:31:39.466839Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Default max_batch_prefill_tokens to 2098 #033[2m2024-08-28T16:31:39.466863Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Bitsandbytes doesn't work with cuda graphs, deactivating them #033[2m2024-08-28T16:31:39.466953Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting check and download process for HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:31:43.674203Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00001-of-00008.safetensors #033[2m2024-08-28T16:31:47.383748Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00001-of-00008.safetensors in 0:00:03. #033[2m2024-08-28T16:31:47.383863Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [1/8] -- ETA: 0:00:21 #033[2m2024-08-28T16:31:47.384137Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00002-of-00008.safetensors #033[2m2024-08-28T16:31:49.038213Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00002-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:49.038294Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [2/8] -- ETA: 0:00:15 #033[2m2024-08-28T16:31:49.038520Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00003-of-00008.safetensors #033[2m2024-08-28T16:31:50.716613Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00003-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:50.716708Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [3/8] -- ETA: 0:00:11.666665 #033[2m2024-08-28T16:31:50.716933Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00004-of-00008.safetensors #033[2m2024-08-28T16:31:52.462098Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00004-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:52.462199Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [4/8] -- ETA: 0:00:08 #033[2m2024-08-28T16:31:52.462415Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00005-of-00008.safetensors #033[2m2024-08-28T16:31:54.155127Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00005-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:54.155206Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [5/8] -- ETA: 0:00:06 #033[2m2024-08-28T16:31:54.155483Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00006-of-00008.safetensors #033[2m2024-08-28T16:31:56.041800Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00006-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:56.041883Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [6/8] -- ETA: 0:00:04 #033[2m2024-08-28T16:31:56.042111Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00007-of-00008.safetensors #033[2m2024-08-28T16:31:58.236431Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00007-of-00008.safetensors in 0:00:02. #033[2m2024-08-28T16:31:58.236513Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [7/8] -- ETA: 0:00:02 #033[2m2024-08-28T16:31:58.236740Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00008-of-00008.safetensors #033[2m2024-08-28T16:31:59.153703Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00008-of-00008.safetensors in 0:00:00. #033[2m2024-08-28T16:31:59.153782Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [8/8] -- ETA: 0 #033[2m2024-08-28T16:31:59.787365Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights for HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:31:59.787579Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:09.797351Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:12.038243Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0 #033[2m2024-08-28T16:32:12.100255Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 12.312095644s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:12.197827Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting Webserver #033[2m2024-08-28T16:32:12.211828Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m228:#033[0m Using the Hugging Face API #033[2m2024-08-28T16:32:12.211862Z#033[0m #033[32m INFO#033[0m #033[2mhf_hub#033[0m#033[2m:#033[0m #033[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs#033[0m#033[2m:#033[0m#033[2m55:#033[0m Token file not found "/root/.cache/huggingface/token" #033[2m2024-08-28T16:32:12.568741Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m577:#033[0m Serving revision b70e0c9a2d9e14bd1e812d3c398e5f313e93b473 of model HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:32:12.611766Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m342:#033[0m Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205 #033[2m2024-08-28T16:32:12.611794Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m357:#033[0m Using config Some(Mistral) #033[2m2024-08-28T16:32:12.611799Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m384:#033[0m Invalid hostname, defaulting to 0.0.0.0 #033[2m2024-08-28T16:32:12.680568Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1572:#033[0m Warming up model #033[2m2024-08-28T16:32:14.286393Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Cuda Graphs are disabled (CUDA_GRAPHS=None). #033[2m2024-08-28T16:32:14.286772Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1599:#033[0m Using scheduler V3 #033[2m2024-08-28T16:32:14.286793Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1651:#033[0m Setting max batch total tokens to 122080 #033[2m2024-08-28T16:32:14.299701Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1889:#033[0m Connected

this is the code i'am using :

import json
from sagemaker.huggingface import HuggingFaceModel

instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 300
endpoint_name = sagemaker.utils.name_from_base("zephyr-tgi-endpoint")
config = {
  'HF_MODEL_ID': "HuggingFaceH4/zephyr-7b-beta", #"/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),
  'MAX_TOTAL_TOKENS' : json.dumps(4096),
  'QUANTIZE': "bitsandbytes-nf4",
   
}
llm_image = "763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0"

llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

llm = llm_model.deploy(
  endpoint_name=endpoint_name,

  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)```

I'm wondering if the issue is related to the TGI image or if quantization with Bitsandbytes is not yet supported. Could you provide guidance on this?

### Information

- [ ] Docker
- [ ] The CLI directly

### Tasks

- [ ] An officially supported command
- [ ] My own modifications

### Reproduction

import json
from sagemaker.huggingface import HuggingFaceModel

instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 300
endpoint_name = sagemaker.utils.name_from_base("zephyr-tgi-endpoint")
config = {
  'HF_MODEL_ID': "HuggingFaceH4/zephyr-7b-beta", #"/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),
  'MAX_TOTAL_TOKENS' : json.dumps(4096),
  'QUANTIZE': "bitsandbytes-nf4",
   
}
llm_image = "763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0"

llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

llm = llm_model.deploy(
  endpoint_name=endpoint_name,

  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

### Expected behavior

When deploying the Zephyr 7B model on a SageMaker endpoint using the TGI configuration, I would expect the quantization specified in the deployment settings (BitsandbytesNF4) to be successfully applied. The model should initialize and run with the enhanced performance and reduced resource consumption that quantization typically offers, without compatibility issues or errors regarding CUDA graphs.

The text was updated successfully, but these errors were encountered:

ErikKaum · 2024-09-03T12:58:33Z

Hi @imadoualid 👋

Thanks for reporting this. Let me get back to you on the specific question: I'm not a 100% sure if CUDA graphs and BNB have some compatibility issues.

With regards to quantisation one note is that BNB does come with a known performance hit. Have have a small section on quantisation in out docs which might guide you to choosing the right direction: https://huggingface.co/docs/text-generation-inference/en/conceptual/quantization#quantization-with-bitsandbytes-eetq--fp8

ErikKaum · 2024-09-03T14:23:56Z

Yes, it seems that BNB is not compatible with cuda graphs:

text-generation-inference/launcher/src/main.rs

Line 1700 in 6cb42f4

    
           tracing::warn!("Bitsandbytes doesn't work with cuda graphs, deactivating them");

The reason is that for cuda graphs to work, all memory must be allocated beforehand. And there seems to be some runtime memory allocation in BNB.

Would some other quantisation method solve your problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization Failure with Bitsandbytes on SageMaker TGI Deployment: Compatibility Issue? #2467

Quantization Failure with Bitsandbytes on SageMaker TGI Deployment: Compatibility Issue? #2467

imadoualid commented Aug 28, 2024

ErikKaum commented Sep 3, 2024

ErikKaum commented Sep 3, 2024

Quantization Failure with Bitsandbytes on SageMaker TGI Deployment: Compatibility Issue? #2467

Quantization Failure with Bitsandbytes on SageMaker TGI Deployment: Compatibility Issue? #2467

Comments

imadoualid commented Aug 28, 2024

System Info

ErikKaum commented Sep 3, 2024

ErikKaum commented Sep 3, 2024