We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker run -p 18080:80 --runtime=habana -v /data/huggingface/hub:/data -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=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 -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true -e PREFILL_BATCH_BUCKET_SIZE=2 -e BATCH_BUCKET_SIZE=32 -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.4 --model-id meta-llama/Llama-2-7b-chat-hf --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 65536 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
2024-08-30T02:09:44.146922Z INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="52.558739096s" validation_time="2.976352ms" queue_time="23.703336184s" inference_time="28.852426791s" time_per_token="57.704853ms" seed="None"}: text_generation_router::server: router/src/server.rs:513: Success 2024-08-30T02:09:44.877111Z INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="52.558665697s" validation_time="1.514834ms" queue_time="23.709660453s" inference_time="28.847490833s" time_per_token="57.694981ms" seed="None"}: text_generation_router::server: router/src/server.rs:513: Success 2024-08-30T02:09:45.863818Z ERROR text_generation_launcher: Method Decode encountered an error. Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 137, in serve server.serve( File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 256, in serve asyncio.run( File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept( > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 154, in Decode generations, next_batch, timings = self.model.generate_token(batches) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 997, in generate_token batch.logits = self.forward( File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 870, in forward return self.model.forward(**kwargs) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward return wrapped_hpugraph_forward( File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 643, in wrapped_hpugraph_forward cached.graph.replayV3(input_tensor_list, cached.asynchronous) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 76, in replayV3 _hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous) RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread... Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::1073741824 (1024)MB [Rank:0] Habana exception raised from get_pointer at device_memory.cpp:1078 2024-08-30T02:09:46.039747Z ERROR batch{batch_size=16}:decode:decode{size=16}:decode{size=16}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED 2024-08-30T02:09:47.968375Z ERROR batch{batch_size=16}:decode:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error 2024-08-30T02:09:47.968553Z ERROR batch{batch_size=16}:decode:clear_cache{batch_id=Some(72)}:clear_cache{batch_id=Some(72)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.968584Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED 2024-08-30T02:09:47.968613Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED 2024-08-30T02:09:47.968632Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED 2024-08-30T02:09:47.968649Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED batch{batch_size=1}:prefill:clear_cache{batch_id=Some(74)}:clear_cache{batch_id=Some(74)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969441Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969517Z ERROR batch{batch_size=1}:prefill:prefill{id=75 size=1}:prefill{id=75 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969560Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(75)}:clear_cache{batch_id=Some(75)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969575Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969645Z ERROR batch{batch_size=1}:prefill:prefill{id=76 size=1}:prefill{id=76 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969705Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(76)}:clear_cache{batch_id=Some(76)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969720Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969791Z ERROR batch{batch_size=1}:prefill:prefill{id=77 size=1}:prefill{id=77 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969834Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(77)}:clear_cache{batch_id=Some(77)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969849Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969917Z ERROR batch{batch_size=1}:prefill:prefill{id=78 size=1}:prefill{id=78 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969955Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(78)}:clear_cache{batch_id=Some(78)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.969970Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970036Z ERROR batch{batch_size=1}:prefill:prefill{id=79 size=1}:prefill{id=79 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970078Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(79)}:clear_cache{batch_id=Some(79)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970094Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970160Z ERROR batch{batch_size=1}:prefill:prefill{id=80 size=1}:prefill{id=80 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970198Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(80)}:clear_cache{batch_id=Some(80)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970213Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970278Z ERROR batch{batch_size=1}:prefill:prefill{id=81 size=1}:prefill{id=81 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970318Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(81)}:clear_cache{batch_id=Some(81)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:47.970334Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.000537Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output: /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 192 CPU RAM : 2113389016 KB ------------------------------------------------------------------------------ Exception ignored in: <function Server.__del__ at 0x7f611e95c790> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/grpc/aio/_server.py", line 194, in __del__ cygrpc.schedule_coro_threadsafe( File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe File "/usr/lib/python3.10/asyncio/base_events.py", line 436, in create_task self._check_closed() File "/usr/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited Task exception was never retrieved future: <Task finished name='HandleExceptions[/generate.v2.TextGenerationService/Decode]' coro=<<coroutine without __name__>()> exception=SystemExit(1)> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 154, in Decode generations, next_batch, timings = self.model.generate_token(batches) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 997, in generate_token batch.logits = self.forward( File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 870, in forward return self.model.forward(**kwargs) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward return wrapped_hpugraph_forward( File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 643, in wrapped_hpugraph_forward cached.graph.replayV3(input_tensor_list, cached.asynchronous) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 76, in replayV3 _hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous) RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread... Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::1073741824 (1024)MB [Rank:0] Habana exception raised from get_pointer at device_memory.cpp:1078 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 137, in serve server.serve( File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 256, in serve asyncio.run( File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 831, in _handle_rpc File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept( File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 33, in intercept exit(1) File "/usr/lib/python3.10/_sitebuiltins.py", line 26, in __call__ raise SystemExit(code) SystemExit: 1 rank=0 2024-08-30T02:09:48.006377Z ERROR batch{batch_size=1}:prefill:prefill{id=82 size=1}:prefill{id=82 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.006461Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(82)}:clear_cache{batch_id=Some(82)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.006484Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.062231Z ERROR batch{batch_size=1}:prefill:prefill{id=118 size=1}:prefill{id=118 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.062267Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(118)}:clear_cache{batch_id=Some(118)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.062276Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.062891Z ERROR batch{batch_size=1}:prefill:prefill{id=119 size=1}:prefill{id=119 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.062914Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(119)}:clear_cache{batch_id=Some(119)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.062921Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.063923Z ERROR batch{batch_size=1}:prefill:prefill{id=120 size=1}:prefill{id=120 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.063944Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(120)}:clear_cache{batch_id=Some(120)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.063951Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.065252Z ERROR batch{batch_size=1}:prefill:prefill{id=121 size=1}:prefill{id=121 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.065270Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(121)}:clear_cache{batch_id=Some(121)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.065275Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.076439Z ERROR batch{batch_size=1}:prefill:prefill{id=122 size=1}:prefill{id=122 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.076463Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(122)}:clear_cache{batch_id=Some(122)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.076470Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-08-30T02:09:48.157110Z INFO text_generation_launcher: webserver terminated 2024-08-30T02:09:48.157132Z INFO text_generation_launcher: Shutting down shards Error: ShardFailed
TGI serve will return correct output result.
The text was updated successfully, but these errors were encountered:
@yuanwu2017 is looking into it.
Sorry, something went wrong.
I can reproduce this issue. It is OOM issue. Debugging in progress.
No branches or pull requests
System Info
Information
Tasks
Reproduction
docker run -p 18080:80 --runtime=habana -v /data/huggingface/hub:/data -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=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 -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true -e PREFILL_BATCH_BUCKET_SIZE=2 -e BATCH_BUCKET_SIZE=32 -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.4 --model-id meta-llama/Llama-2-7b-chat-hf --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 65536 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
Error log
Expected behavior
TGI serve will return correct output result.
The text was updated successfully, but these errors were encountered: