You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to deploy Zephyr 7b on a SageMaker endpoint using TGI. However, I noticed that quantization is not applied, and the logs indicate, 'Bitsandbytes doesn't work with CUDA graphs, deactivating them.' Here are the relevant logs : #033[2m2024-08-28T16:31:39.278738Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "HuggingFaceH4/zephyr-7b-beta", revision: None, validation_workers: 2, sharded: None, num_shard: Some( 1, ), quantize: Some( BitsandbytesNF4, ), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 2048, ), max_total_tokens: Some( 4096, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/tmp", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, } #033[2m2024-08-28T16:31:39.278895Z#033[0m #033[32m INFO#033[0m #033[2mhf_hub#033[0m#033[2m:#033[0m Token file not found "/root/.cache/huggingface/token" #033[2m2024-08-28T16:31:39.466839Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Default max_batch_prefill_tokens to 2098 #033[2m2024-08-28T16:31:39.466863Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Bitsandbytes doesn't work with cuda graphs, deactivating them #033[2m2024-08-28T16:31:39.466953Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting check and download process for HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:31:43.674203Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00001-of-00008.safetensors #033[2m2024-08-28T16:31:47.383748Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00001-of-00008.safetensors in 0:00:03. #033[2m2024-08-28T16:31:47.383863Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [1/8] -- ETA: 0:00:21 #033[2m2024-08-28T16:31:47.384137Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00002-of-00008.safetensors #033[2m2024-08-28T16:31:49.038213Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00002-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:49.038294Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [2/8] -- ETA: 0:00:15 #033[2m2024-08-28T16:31:49.038520Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00003-of-00008.safetensors #033[2m2024-08-28T16:31:50.716613Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00003-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:50.716708Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [3/8] -- ETA: 0:00:11.666665 #033[2m2024-08-28T16:31:50.716933Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00004-of-00008.safetensors #033[2m2024-08-28T16:31:52.462098Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00004-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:52.462199Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [4/8] -- ETA: 0:00:08 #033[2m2024-08-28T16:31:52.462415Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00005-of-00008.safetensors #033[2m2024-08-28T16:31:54.155127Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00005-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:54.155206Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [5/8] -- ETA: 0:00:06 #033[2m2024-08-28T16:31:54.155483Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00006-of-00008.safetensors #033[2m2024-08-28T16:31:56.041800Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00006-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:56.041883Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [6/8] -- ETA: 0:00:04 #033[2m2024-08-28T16:31:56.042111Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00007-of-00008.safetensors #033[2m2024-08-28T16:31:58.236431Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00007-of-00008.safetensors in 0:00:02. #033[2m2024-08-28T16:31:58.236513Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [7/8] -- ETA: 0:00:02 #033[2m2024-08-28T16:31:58.236740Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00008-of-00008.safetensors #033[2m2024-08-28T16:31:59.153703Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00008-of-00008.safetensors in 0:00:00. #033[2m2024-08-28T16:31:59.153782Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [8/8] -- ETA: 0 #033[2m2024-08-28T16:31:59.787365Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights for HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:31:59.787579Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:09.797351Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:12.038243Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0 #033[2m2024-08-28T16:32:12.100255Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 12.312095644s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:12.197827Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting Webserver #033[2m2024-08-28T16:32:12.211828Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m228:#033[0m Using the Hugging Face API #033[2m2024-08-28T16:32:12.211862Z#033[0m #033[32m INFO#033[0m #033[2mhf_hub#033[0m#033[2m:#033[0m #033[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs#033[0m#033[2m:#033[0m#033[2m55:#033[0m Token file not found "/root/.cache/huggingface/token" #033[2m2024-08-28T16:32:12.568741Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m577:#033[0m Serving revision b70e0c9a2d9e14bd1e812d3c398e5f313e93b473 of model HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:32:12.611766Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m342:#033[0m Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205 #033[2m2024-08-28T16:32:12.611794Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m357:#033[0m Using config Some(Mistral) #033[2m2024-08-28T16:32:12.611799Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m384:#033[0m Invalid hostname, defaulting to 0.0.0.0 #033[2m2024-08-28T16:32:12.680568Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1572:#033[0m Warming up model #033[2m2024-08-28T16:32:14.286393Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Cuda Graphs are disabled (CUDA_GRAPHS=None). #033[2m2024-08-28T16:32:14.286772Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1599:#033[0m Using scheduler V3 #033[2m2024-08-28T16:32:14.286793Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1651:#033[0m Setting max batch total tokens to 122080 #033[2m2024-08-28T16:32:14.299701Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1889:#033[0m Connected
this is the code i'am using :
import json
from sagemaker.huggingface import HuggingFaceModel
instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 300
endpoint_name = sagemaker.utils.name_from_base("zephyr-tgi-endpoint")
config = {
'HF_MODEL_ID': "HuggingFaceH4/zephyr-7b-beta", #"/opt/ml/model", # path to where sagemaker stores the model
'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(2048),
'MAX_TOTAL_TOKENS' : json.dumps(4096),
'QUANTIZE': "bitsandbytes-nf4",
}
llm_image = "763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0"
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
llm = llm_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)```
I'm wondering if the issue is related to the TGI image or if quantization with Bitsandbytes is not yet supported. Could you provide guidance on this?
### Information
- [ ] Docker
- [ ] The CLI directly
### Tasks
- [ ] An officially supported command
- [ ] My own modifications
### Reproduction
import json
from sagemaker.huggingface import HuggingFaceModel
instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 300
endpoint_name = sagemaker.utils.name_from_base("zephyr-tgi-endpoint")
config = {
'HF_MODEL_ID': "HuggingFaceH4/zephyr-7b-beta", #"/opt/ml/model", # path to where sagemaker stores the model
'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(2048),
'MAX_TOTAL_TOKENS' : json.dumps(4096),
'QUANTIZE': "bitsandbytes-nf4",
}
llm_image = "763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0"
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
llm = llm_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)
### Expected behavior
When deploying the Zephyr 7B model on a SageMaker endpoint using the TGI configuration, I would expect the quantization specified in the deployment settings (BitsandbytesNF4) to be successfully applied. The model should initialize and run with the enhanced performance and reduced resource consumption that quantization typically offers, without compatibility issues or errors regarding CUDA graphs.
The text was updated successfully, but these errors were encountered:
System Info
I'm trying to deploy Zephyr 7b on a SageMaker endpoint using TGI. However, I noticed that quantization is not applied, and the logs indicate, 'Bitsandbytes doesn't work with CUDA graphs, deactivating them.' Here are the relevant logs :
#033[2m2024-08-28T16:31:39.278738Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "HuggingFaceH4/zephyr-7b-beta", revision: None, validation_workers: 2, sharded: None, num_shard: Some( 1, ), quantize: Some( BitsandbytesNF4, ), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 2048, ), max_total_tokens: Some( 4096, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/tmp", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, } #033[2m2024-08-28T16:31:39.278895Z#033[0m #033[32m INFO#033[0m #033[2mhf_hub#033[0m#033[2m:#033[0m Token file not found "/root/.cache/huggingface/token" #033[2m2024-08-28T16:31:39.466839Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Default
max_batch_prefill_tokensto 2098 #033[2m2024-08-28T16:31:39.466863Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Bitsandbytes doesn't work with cuda graphs, deactivating them #033[2m2024-08-28T16:31:39.466953Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting check and download process for HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:31:43.674203Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00001-of-00008.safetensors #033[2m2024-08-28T16:31:47.383748Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00001-of-00008.safetensors in 0:00:03. #033[2m2024-08-28T16:31:47.383863Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [1/8] -- ETA: 0:00:21 #033[2m2024-08-28T16:31:47.384137Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00002-of-00008.safetensors #033[2m2024-08-28T16:31:49.038213Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00002-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:49.038294Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [2/8] -- ETA: 0:00:15 #033[2m2024-08-28T16:31:49.038520Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00003-of-00008.safetensors #033[2m2024-08-28T16:31:50.716613Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00003-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:50.716708Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [3/8] -- ETA: 0:00:11.666665 #033[2m2024-08-28T16:31:50.716933Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00004-of-00008.safetensors #033[2m2024-08-28T16:31:52.462098Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00004-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:52.462199Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [4/8] -- ETA: 0:00:08 #033[2m2024-08-28T16:31:52.462415Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00005-of-00008.safetensors #033[2m2024-08-28T16:31:54.155127Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00005-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:54.155206Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [5/8] -- ETA: 0:00:06 #033[2m2024-08-28T16:31:54.155483Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00006-of-00008.safetensors #033[2m2024-08-28T16:31:56.041800Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00006-of-00008.safetensors in 0:00:01. #033[2m2024-08-28T16:31:56.041883Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [6/8] -- ETA: 0:00:04 #033[2m2024-08-28T16:31:56.042111Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00007-of-00008.safetensors #033[2m2024-08-28T16:31:58.236431Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00007-of-00008.safetensors in 0:00:02. #033[2m2024-08-28T16:31:58.236513Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [7/8] -- ETA: 0:00:02 #033[2m2024-08-28T16:31:58.236740Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download file: model-00008-of-00008.safetensors #033[2m2024-08-28T16:31:59.153703Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Downloaded /tmp/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-00008-of-00008.safetensors in 0:00:00. #033[2m2024-08-28T16:31:59.153782Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download: [8/8] -- ETA: 0 #033[2m2024-08-28T16:31:59.787365Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights for HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:31:59.787579Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:09.797351Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:12.038243Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0 #033[2m2024-08-28T16:32:12.100255Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 12.312095644s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m #033[2m2024-08-28T16:32:12.197827Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting Webserver #033[2m2024-08-28T16:32:12.211828Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m228:#033[0m Using the Hugging Face API #033[2m2024-08-28T16:32:12.211862Z#033[0m #033[32m INFO#033[0m #033[2mhf_hub#033[0m#033[2m:#033[0m #033[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs#033[0m#033[2m:#033[0m#033[2m55:#033[0m Token file not found "/root/.cache/huggingface/token" #033[2m2024-08-28T16:32:12.568741Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m577:#033[0m Serving revision b70e0c9a2d9e14bd1e812d3c398e5f313e93b473 of model HuggingFaceH4/zephyr-7b-beta #033[2m2024-08-28T16:32:12.611766Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m342:#033[0m Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205 #033[2m2024-08-28T16:32:12.611794Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m357:#033[0m Using config Some(Mistral) #033[2m2024-08-28T16:32:12.611799Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m384:#033[0m Invalid hostname, defaulting to 0.0.0.0 #033[2m2024-08-28T16:32:12.680568Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1572:#033[0m Warming up model #033[2m2024-08-28T16:32:14.286393Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Cuda Graphs are disabled (CUDA_GRAPHS=None). #033[2m2024-08-28T16:32:14.286772Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1599:#033[0m Using scheduler V3 #033[2m2024-08-28T16:32:14.286793Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1651:#033[0m Setting max batch total tokens to 122080 #033[2m2024-08-28T16:32:14.299701Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1889:#033[0m Connected
this is the code i'am using :
The text was updated successfully, but these errors were encountered: