[BUG] OpenAI client takes a long time to receive the last token on every few generations #274

alllexx88 · 2025-01-21T09:51:45Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.10

Describe the bug

I'm trying to serve bartowski/Llama-3.3-70B-Instruct-exl2 with tabbyAPI. Right now I'm using 8.0 bits quant, but also tried the smaller versions. I have x6 Nvidia RTX A4000 on the server. Originally, without tweaking sampling options, an OpenAI client was waiting for the last token for a long while on each generation. I tried many settings, and right now it happens each 27th generation on my test query with almost exactly 60 seconds lag. The logs on the server side don't report this lag. I discovered that if I reduce cache_size and max_model_length in 2 times, the lag also decreases proportionally to about 30 seconds, and the same pattern holds for 1/4 of the cache size.

My current configs

config.yml:

# Sample YAML file for configuration.
# Comment and uncomment values as needed.
# Every value has a default within the application.
# This file serves to be a drop in for config.yml

# Unless specified in the comments, DO NOT put these options in quotes!
# You can use https://www.yamllint.com/ if you want to check your YAML formatting.

# Options for networking
network:
  # The IP to host on (default: 127.0.0.1).
  # Use 0.0.0.0 to expose on all network adapters.
  host: 0.0.0.0

  # The port to host on (default: 5000).
  port: 5000

  # Disable HTTP token authentication with requests.
  # WARNING: This will make your instance vulnerable!
  # Turn on this option if you are ONLY connecting from localhost.
  disable_auth: false

  # Disable fetching external content in response to requests, such as images from URLs.
  disable_fetch_requests: false

  # Send tracebacks over the API (default: False).
  # NOTE: Only enable this for debug purposes.
  send_tracebacks: false

  # Select API servers to enable (default: ["OAI"]).
  # Possible values: OAI, Kobold.
  api_servers: ["OAI"]

# Options for logging
logging:
  # Enable prompt logging (default: False).
  log_prompt: true

  # Enable generation parameter logging (default: False).
  log_generation_params: true

  # Enable request logging (default: False).
  # NOTE: Only use this for debugging!
  log_requests: true

# Options for model overrides and loading
# Please read the comments to understand how arguments are handled
# between initial and API loads
model:
  # Directory to look for models (default: models).
  # Windows users, do NOT put this path in quotes!
  model_dir: ../models

  # Allow direct loading of models from a completion or chat completion request (default: False).
  # This method of loading is strict by default.
  # Enable dummy models to add exceptions for invalid model names.
  inline_model_loading: false

  # Sends dummy model names when the models endpoint is queried. (default: False)
  # Enable this if the client is looking for specific OAI models.
  use_dummy_models: false

  # A list of fake model names that are sent via the /v1/models endpoint. (default: ["gpt-3.5-turbo"])
  # Also used as bypasses for strict mode if inline_model_loading is true.
  dummy_model_names: ["gpt-3.5-turbo"]

  # An initial model to load.
  # Make sure the model is located in the model directory!
  # REQUIRED: This must be filled out to load a model on startup.
  model_name: Llama-3.3-70B-Instruct-exl2

  # Names of args to use as a fallback for API load requests (default: []).
  # For example, if you always want cache_mode to be Q4 instead of on the inital model load, add "cache_mode" to this array.
  # Example: ['max_seq_len', 'cache_mode'].
  use_as_default: []

  # Max sequence length (default: Empty).
  # Fetched from the model's base sequence length in config.json by default.
  max_seq_len: 131072
#  max_seq_len: 65536
#  max_seq_len: 32768

  # Load model with tensor parallelism.
  # Falls back to autosplit if GPU split isn't provided.
  # This ignores the gpu_split_auto value.
  tensor_parallel: true

  # Automatically allocate resources to GPUs (default: True).
  # Not parsed for single GPU users.
  gpu_split_auto: true

  # Reserve VRAM used for autosplit loading (default: 96 MB on GPU 0).
  # Represented as an array of MB per GPU.
  autosplit_reserve: [96]
#  autosplit_reserve: [96, 96, 96, 96, 96, 96]

  # An integer array of GBs of VRAM to split between GPUs (default: []).
  # Used with tensor parallelism.
  gpu_split: []
#  gpu_split: [20, 20, 20, 20, 20, 16]

  # Rope scale (default: 1.0).
  # Same as compress_pos_emb.
  # Use if the model was trained on long context with rope.
  # Leave blank to pull the value from the model.
  rope_scale:

  # Rope alpha (default: None).
  # Same as alpha_value. Set to "auto" to auto-calculate.
  # Leaving this value blank will either pull from the model or auto-calculate.
  rope_alpha:

  # Enable different cache modes for VRAM savings (default: FP16).
  # Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
  cache_mode: Q4

  # Size of the prompt cache to allocate (default: max_seq_len).
  # Must be a multiple of 256 and can't be less than max_seq_len.
  # For CFG, set this to 2 * max_seq_len.
#  cache_size: 32768
#  cache_size: 65536
  cache_size: 131072

  # Chunk size for prompt ingestion (default: 2048).
  # A lower value reduces VRAM usage but decreases ingestion speed.
  # NOTE: Effects vary depending on the model.
  # An ideal value is between 512 and 4096.
#  chunk_size: 256
  chunk_size: 2048

  # Set the maximum number of prompts to process at one time (default: None/Automatic).
  # Automatically calculated if left blank.
  # NOTE: Only available for Nvidia ampere (30 series) and above GPUs.
  max_batch_size: 1

  # Set the prompt template for this model. (default: None)
  # If empty, attempts to look for the model's chat template.
  # If a model contains multiple templates in its tokenizer_config.json,
  # set prompt_template to the name of the template you want to use.
  # NOTE: Only works with chat completion message lists!
  prompt_template:

  # Enables vision support if the model supports it. (default: False)
  vision: false

  # Number of experts to use per token.
  # Fetched from the model's config.json if empty.
  # NOTE: For MoE models only.
  # WARNING: Don't set this unless you know what you're doing!
  num_experts_per_token:

# Options for Loras
lora:
  # Directory to look for LoRAs (default: loras).
  lora_dir: loras

  # List of LoRAs to load and associated scaling factors (default scale: 1.0).
  # For the YAML file, add each entry as a YAML list:
  # - name: lora1
  #   scaling: 1.0
  loras:

# Options for Sampling
sampling:
  # Select a sampler override preset (default: None).
  # Find this in the sampler-overrides folder.
  # This overrides default fallbacks for sampler values that are passed to the API.
  override_preset: fast_streaming

# Options for development and experimentation
developer:
  # Skip Exllamav2 version check (default: False).
  # WARNING: It's highly recommended to update your dependencies rather than enabling this flag.
  unsafe_launch: false

  # Disable API request streaming (default: False).
  disable_request_streaming: false

  # Enable the torch CUDA malloc backend (default: False).
  cuda_malloc_backend: false

  # Run asyncio using Uvloop or Winloop which can improve performance.
  # NOTE: It's recommended to enable this, but if something breaks turn this off.
  uvloop: false

  # Set process to use a higher priority.
  # For realtime process priority, run as administrator or sudo.
  # Otherwise, the priority will be set to high.
  realtime_process_priority: false

sampler_overrides/fast_streaming.yml:

temperature:
  override: 0.01
  force: true
top_k:
  override: 1
  force: true
top_p:
  override: 0.1
  force: true
typical:
  override: 0.0
  force: true
tfs:
  override: 0.0
  force: true
max_tokens:
  override: 4096
  force: true

My test case script

import time
import asyncio
from openai import AsyncOpenAI, OpenAI
from datetime import datetime
import numpy as np

# Configuration
BASE_URL = "http://ml-x8-rtx4000.amb.corp:5000/v1"
API_KEY = "*******************************"

# The exact messages from your log
MESSAGES = [
    {
        "content": """You are a Llama3.3 model, and your task is CONTEXTUALIZATION. Your goal is to rephrase the user's last message so it becomes fully understandable without the chat history.

**Key Instructions**:
1. **Do NOT answer or explain**. Only rephrase the user's last message into a clear, standalone query or statement.
2. **Extract relevant context** from the conversation, if it is essential to make the last message self-contained. Do not contradict or ignore important details from the chat history.
3. **Avoid adding unnecessary details** or expanding beyond the original scope. If the needed context is missing, do not invent it.
4. **Preserve meaning and formatting**:
   - Keep the user's intent and any technical/emotional details.
   - Maintain the message format: if it's a question, keep it a question; if it's a statement, keep it a statement.
5. **Ensure grammatical accuracy**. The rephrased text must be error-free, unambiguous, and syntactically clear.
6. **No context needed → return original**. If there is no history or it's irrelevant, provide the user's exact message with no changes.

**Additional Clarifications**:
- **Never provide a direct answer**. If the user asks, "What is X?", do not define X; only restate, "What is X?"
- **Short follow-ups**: For "What about X?" or "Why?" that refer to something from the history, rephrase it to include the necessary detail, but do not add new information. Example: "Why is Python simpler than JavaScript for beginners?"
- **Descriptive queries** ("What's special about it?"): rephrase by specifying the subject from the history, not by listing details.
- **No redundancy**: do not repeat details from previous messages unless explicitly asked.

**Important Notes**:
- Always respond in **Ukrainian** (the final rephrased text).
- Never provide an answer or definition; only rephrase to make the user's last message standalone.
- Preserve the user's tone, key terms, and any essential details from the history.""",
        "role": "system"
    },
    {
        "content": "Як працює блокчейн?",
        "role": "user"
    },
    {
        "content": "Це централізована система для зберігання даних.",
        "role": "assistant"
    },
    {
        "content": "А безпека?",
        "role": "user"
    }
]

async def test_streaming():
    """Test streaming completion with detailed token timing"""
    from collections import deque

    # Keep track of last 5 token timings for variance analysis
    recent_timings = deque(maxlen=5)
    client = AsyncOpenAI(
        base_url=BASE_URL,
        api_key=API_KEY
    )

    print(f"\nStarting streaming test at {datetime.now()}")
    start_time = time.time()

    try:
        stream = await client.chat.completions.create(
            model="Llama-3.3-70B-Instruct-exl2",
            messages=MESSAGES,
            stream=True,
            temperature=0
        )

        print("Receiving stream:")
        last_token_time = start_time
        total_tokens = 0

        async for chunk in stream:
            current_time = time.time()
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                token_time = current_time - last_token_time
                total_tokens += 1

                # Add timing to recent timings
                recent_timings.append(token_time)

                # Calculate variance if we have enough samples
                variance = f", variance: {np.var(recent_timings):.4f}" if len(recent_timings) > 1 else ""

                print(f"Token {total_tokens:2d} at {current_time - start_time:6.2f}s (delta: {token_time:5.2f}s{variance}): {token!r}", flush=True)
                last_token_time = current_time

        print(f"\nStreaming completed in {time.time() - start_time:.2f} seconds")

    except Exception as e:
        print(f"Error during streaming: {e}")

def test_non_streaming():
    """Test non-streaming completion"""
    client = OpenAI(
        base_url=BASE_URL,
        api_key=API_KEY
    )

    print(f"\nStarting non-streaming test at {datetime.now()}")
    start_time = time.time()

    try:
        response = client.chat.completions.create(
            model="Llama-3.3-70B-Instruct-exl2",
            messages=MESSAGES,
            stream=False,
            temperature=0.0
        )

        print(f"Response received in {time.time() - start_time:.2f} seconds:")
        print(response.choices[0].message.content)

    except Exception as e:
        print(f"Error during non-streaming request: {e}")

async def run_tests(iterations=20):
    for i in range(iterations):
        print(f"\n=== Iteration {i+1}/{iterations} ===")

        # Run streaming test
        await test_streaming()

        # Run non-streaming test
        #test_non_streaming()

        # Small delay between iterations
        await asyncio.sleep(1)

if __name__ == "__main__":
    asyncio.run(run_tests(100))

Here the test script uses streaming API, but the issue also manifests with non streaming requests.

Reproduction steps

Maybe the bug is my hardware/setup specific, I don't know. For me it takes to just launch the server and run the test case script, and wait until it lags. On me it happens right now every 27th query.

Expected behavior

It's expected not to lag, like it does for the majority of the queries.

Logs

Client log when it lags

Starting streaming test at 2025-01-21 11:02:12.575385
Receiving stream:
Token  1 at   0.52s (delta:  0.52s): 'Як'
Token  2 at   0.73s (delta:  0.21s, variance: 0.0239): ' забезпеч'
Token  3 at   0.94s (delta:  0.21s, variance: 0.0211): 'ується'
Token  4 at   1.15s (delta:  0.21s, variance: 0.0178): ' без'
Token  5 at   1.36s (delta:  0.21s, variance: 0.0151): 'п'
Token  6 at   1.57s (delta:  0.21s, variance: 0.0000): 'ека'
Token  7 at   1.78s (delta:  0.21s, variance: 0.0000): ' в'
Token  8 at   1.99s (delta:  0.21s, variance: 0.0000): ' систем'
Token  9 at   2.20s (delta:  0.21s, variance: 0.0000): 'і'
Token 10 at   2.41s (delta:  0.21s, variance: 0.0000): ' блок'
Token 11 at   2.62s (delta:  0.21s, variance: 0.0000): 'ч'
Token 12 at   2.83s (delta:  0.21s, variance: 0.0000): 'ейн'
Token 13 at  59.79s (delta: 56.96s, variance: 515.3742): '?'

Streaming completed in 59.80 seconds

Server log when it lags

INFO:     Information for POST request 438367124f6440fd9036cbd0491ea3c5:
INFO:     URL: http://ml-x8-rtx4000.amb.corp:5000/v1/chat/completions
INFO:     Headers: {'host': 'ml-x8-rtx4000.amb.corp:5000', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'accept': 'application/json', 'content-type': 'application/json', 'user-agent': 'AsyncOpenAI/Python
1.55.0', 'x-stainless-lang': 'python', 'x-stainless-package-version': '1.55.0', 'x-stainless-os': 'Linux', 'x-stainless-arch': 'x64', 'x-stainless-runtime': 'CPython', 'x-stainless-runtime-version': '3.12.7', 'authorization':
'Bearer *******************************', 'x-stainless-async': 'async:asyncio', 'x-stainless-retry-count': '0', 'content-length': '2641'}
INFO:     Body: {'messages': [{'content': 'You are a Llama3.3 model, and your task is CONTEXTUALIZATION. Your goal is to rephrase the user\'s last message so it becomes fully understandable without the chat history.\n\n**Key
Instructions**:\n1. **Do NOT answer or explain**. Only rephrase the user\'s last message into a clear, standalone query or statement.\n2. **Extract relevant context** from the conversation, if it is essential to make the last
message self-contained. Do not contradict or ignore important details from the chat history.\n3. **Avoid adding unnecessary details** or expanding beyond the original scope. If the needed context is missing, do not invent it.\n4.
**Preserve meaning and formatting**:\n   - Keep the user\'s intent and any technical/emotional details.\n   - Maintain the message format: if it\'s a question, keep it a question; if it\'s a statement, keep it a statement.\n5.
**Ensure grammatical accuracy**. The rephrased text must be error-free, unambiguous, and syntactically clear.\n6. **No context needed → return original**. If there is no history or it\'s irrelevant, provide the user\'s exact
message with no changes.\n\n**Additional Clarifications**:\n- **Never provide a direct answer**. If the user asks, "What is X?", do not define X; only restate, "What is X?"  \n- **Short follow-ups**: For "What about X?" or "Why?"
that refer to something from the history, rephrase it to include the necessary detail, but do not add new information. Example: "Why is Python simpler than JavaScript for beginners?"\n- **Descriptive queries** ("What\'s special
about it?"): rephrase by specifying the subject from the history, not by listing details.  \n- **No redundancy**: do not repeat details from previous messages unless explicitly asked.\n\n**Important Notes**:\n- Always respond in
**Ukrainian** (the final rephrased text).\n- Never provide an answer or definition; only rephrase to make the user\'s last message standalone.\n- Preserve the user\'s tone, key terms, and any essential details from the history.',
'role': 'system'}, {'content': 'Як працює блокчейн?', 'role': 'user'}, {'content': 'Це централізована система для зберігання даних.', 'role': 'assistant'}, {'content': 'А безпека?', 'role': 'user'}], 'model':
'Llama-3.3-70B-Instruct-exl2', 'stream': True, 'temperature': 0.0}
INFO:     10.30.24.14:51836 - "POST /v1/chat/completions HTTP/1.1" 200
INFO:     Received chat completion streaming request 438367124f6440fd9036cbd0491ea3c5
INFO:     Prompt (ID: 438367124f6440fd9036cbd0491ea3c5):
INFO:     <|begin_of_text|><|start_header_id|>system<|end_header_id|>
INFO:
INFO:     Cutting Knowledge Date: December 2023
INFO:     Today Date: 26 Jul 2024
INFO:
INFO:     You are a Llama3.3 model, and your task is CONTEXTUALIZATION. Your goal is to rephrase the user's last message so it becomes fully understandable without the chat history.
INFO:
INFO:     **Key Instructions**:                                                                                                                                                                                             [112/1886]
INFO:     1. **Do NOT answer or explain**. Only rephrase the user's last message into a clear, standalone query or statement.
INFO:     2. **Extract relevant context** from the conversation, if it is essential to make the last message self-contained. Do not contradict or ignore important details from the chat history.
INFO:     3. **Avoid adding unnecessary details** or expanding beyond the original scope. If the needed context is missing, do not invent it.
INFO:     4. **Preserve meaning and formatting**:
INFO:        - Keep the user's intent and any technical/emotional details.
INFO:        - Maintain the message format: if it's a question, keep it a question; if it's a statement, keep it a statement.
INFO:     5. **Ensure grammatical accuracy**. The rephrased text must be error-free, unambiguous, and syntactically clear.
INFO:     6. **No context needed → return original**. If there is no history or it's irrelevant, provide the user's exact message with no changes.
INFO:
INFO:     **Additional Clarifications**:
INFO:     - **Never provide a direct answer**. If the user asks, "What is X?", do not define X; only restate, "What is X?"
INFO:     - **Short follow-ups**: For "What about X?" or "Why?" that refer to something from the history, rephrase it to include the necessary detail, but do not add new information. Example: "Why is Python simpler than JavaScript
for beginners?"
INFO:     - **Descriptive queries** ("What's special about it?"): rephrase by specifying the subject from the history, not by listing details.
INFO:     - **No redundancy**: do not repeat details from previous messages unless explicitly asked.
INFO:
INFO:     **Important Notes**:
INFO:     - Always respond in **Ukrainian** (the final rephrased text).
INFO:     - Never provide an answer or definition; only rephrase to make the user's last message standalone.
INFO:     - Preserve the user's tone, key terms, and any essential details from the history.<|eot_id|><|start_header_id|>user<|end_header_id|>
INFO:
INFO:     Як працює блокчейн?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
INFO:
INFO:     Це централізована система для зберігання даних.<|eot_id|><|start_header_id|>user<|end_header_id|>
INFO:
INFO:     А безпека?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
INFO:
INFO:
INFO:     Response (ID: 438367124f6440fd9036cbd0491ea3c5):
INFO:     Як забезпечується безпека в системі блокчейн?
INFO:     Finished chat completion streaming request 438367124f6440fd9036cbd0491ea3c5
INFO:     Generation options: {'request_id': '438367124f6440fd9036cbd0491ea3c5', 'max_tokens': 4096, 'min_tokens': 0, 'stream': True, 'token_repetition_penalty': 1.0, 'token_repetition_range': -1, 'token_repetition_decay': 0,
'token_frequency_penalty': 0.0, 'token_presence_penalty': 0.0, 'temperature': 0.01, 'smoothing_factor': 0.0, 'min_temp': 1.0, 'max_temp': 1.0, 'temp_exponent': 1.0, 'top_k': 1, 'top_p': 0.1, 'top_a': 0.0, 'min_p': 0.0, 'tfs': 0.0,
'typical': 0.0, 'skew': 0.0, 'temperature_last': False, 'mirostat': False, 'mirostat_tau': 1.5, 'mirostat_eta': 0.3, 'mirostat_mu': None, 'token_bias': None, 'cfg_scale': None, 'post_sampling_hooks': [], 'dry_allowed_length': 2,
'dry_base': 1.75, 'dry_multiplier': 0.0, 'dry_sequence_breakers': None, 'dry_range': 0, 'dry_max_ngram': 20, 'ngram_trie': None, 'ngram_index': 0, 'ngram_history': deque([]), 'xtc_probability': 0.0, 'xtc_threshold': 0.1,
'xtc_ignore_tokens': None, 'token_healing': False, 'auto_scale_penalty_range': False, 'generate_window': 16384, 'bos_token_id': 128000, 'eos_token_id': [128001, 128008, 128009], 'add_bos_token': True, 'ban_eos_token': False,
'skip_special_tokens': True, 'speculative_ngram': False, 'logprobs': 0, 'stop_conditions': [128001, 128008, 128009], 'banned_tokens': [], 'allowed_tokens': [], 'banned_strings': [], 'logit_bias': None, 'filters': []}

INFO:     Metrics (ID: 438367124f6440fd9036cbd0491ea3c5): 14 tokens generated in 2.98 seconds (Queue: 0.0 s, Process: 512 cached tokens and 1 new tokens at 414.66 T/s, Generate: 4.71 T/s, Context: 513 tokens)

Additional context

No response

Acknowledgements

I have looked for similar issues before submitting this one.
I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

I hope this can be fixed, tabbyAPI performs awesome in all other aspects. Thank you in advance!

The text was updated successfully, but these errors were encountered:

alllexx88 · 2025-01-21T17:27:45Z

To rule out deployment issues, I tried the same configs with ghcr.io/theroyallab/tabbyapi:latest docker image:

services:
  tabbyapi:
    image: ghcr.io/theroyallab/tabbyapi:latest
    ports:
      - "5000:5000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://127.0.0.1:5000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    environment:
      - NAME=TabbyAPI
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ../models:/models
      - ./config.yml:/app/config.yml
      - ./api_tokens.yml:/app/api_tokens.yml
      - ./sampler_overrides/fast_streaming.yml:/app/sampler_overrides/fast_streaming.yml
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

I'm experiencing the exact same lags.

Ph0rk0z · 2025-01-23T18:44:16Z

Do you have similar bug to me? turboderp-org/exllamav2#630

alllexx88 · 2025-01-27T11:57:09Z

@Ph0rk0z I think it's different. In my case the logged T/s don't drop, but I see a long delay on the client side. You also wrote you have to restart the server sometimes to get good T/s back, but in my case the next prompt is okay, until the delay is back later after more requests.

Ph0rk0z · 2025-01-27T13:22:24Z

I also can have it reprocess context and have the delay go away but with cached context can't continue.

alllexx88 added the bug Something isn't working label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OpenAI client takes a long time to receive the last token on every few generations #274

[BUG] OpenAI client takes a long time to receive the last token on every few generations #274

alllexx88 commented Jan 21, 2025 •

edited

Loading

alllexx88 commented Jan 21, 2025

Ph0rk0z commented Jan 23, 2025

alllexx88 commented Jan 27, 2025

Ph0rk0z commented Jan 27, 2025

[BUG] OpenAI client takes a long time to receive the last token on every few generations #274

[BUG] OpenAI client takes a long time to receive the last token on every few generations #274

Comments

alllexx88 commented Jan 21, 2025 • edited Loading

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

alllexx88 commented Jan 21, 2025

Ph0rk0z commented Jan 23, 2025

alllexx88 commented Jan 27, 2025

Ph0rk0z commented Jan 27, 2025

alllexx88 commented Jan 21, 2025 •

edited

Loading