Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guide on how to use TensorRT-LLM Backend #2466

Open
michaelthreet opened this issue Aug 28, 2024 · 9 comments
Open

Guide on how to use TensorRT-LLM Backend #2466

michaelthreet opened this issue Aug 28, 2024 · 9 comments

Comments

@michaelthreet
Copy link

Feature request

Does any documentation exist, or would it be possible to add documentation, on how to use the TensorRT-LLM backend? #2458 makes mention that the TRT-LLM backend exists, and I can see that there's a Dockerfile for TRT-LLM, but I don't see any guides on how to build/use it.

Motivation

I would like to run TensorRT-LLM models using TGI.

Your contribution

I'm willing to test any builds/processes/pipelines that are available.

@ErikKaum
Copy link
Member

Hi @michaelthreet 👋

Very good questions. And indeed we haven't yet documented that well how the new backend design works. Basically the best guide is looking at the info in the dockerfile.

But I'll loop in @mfuntowicz, he can better show you in the right direction and point what are the system requirements 👍

@mfuntowicz
Copy link
Member

Hi @michaelthreet - thanks for your interest in the TRTLLM backend.

The overall backend is pretty new and might suffer from edge cases not being handled but it should be usable.
I would advise to move to this branch which refactor the backend to avoid all the locks and improve the overall throughput (significantly)

➡️ #2357

As I mentioned, the overall backend is still WIP and I would not qualify it as "stable" so we do not offer prebuild images yet.
Still, it should be fairly easy to build the docker container locally from the TGI repository:

docker build -t huggingface/text-generation-inference-trtllm:v2.1.1 -f backends/trtllm/Dockerfile .

Let us know if you encounter any issues for building 😊.

Finally, when you've got the container ready, you should be able to deploy it using the following:

docker run --gpus all --shm-size=16gb -v <host/path/to/engines/folder>:/repository --tokenizer-name <model_id_or_path> /repository

Please let us know if you encounter any blocker, more than happy to help and get your feedback

@michaelthreet
Copy link
Author

michaelthreet commented Aug 29, 2024

Thanks @mfuntowicz, that's all great info! I was able to build the image and run it, although with a modified command to account for the required args. I have a directory within the engine directory that contains the tokenizer, hence using /repository/tokenizer for the --tokenizer-name arg:

docker run --gpus all --shm-size=16gb -v </host/path/to/engines/folder>:/repository huggingface/text-generation-inference-trtllm:v2.1.1 --tokenizer-name /repository/tokenizer --model-id /repository --executor-worker /usr/local/tgi/bin/executorWorker

I'm seeing this error, however, and I'm assuming it's due to a mismatch in the TRT-LLM version the engine was compiled with and the version running in this TGI image.

[2024-08-29 16:11:07.548] [info] [ffi.cpp:75] Creating TensorRT-LLM Backend
[2024-08-29 16:11:07.548] [info] [backend.cpp:11] Initializing Backend...
[2024-08-29 16:11:07.631] [info] [backend.cpp:15] Backend Executor Version: 0.12.0.dev2024073000
[2024-08-29 16:11:07.631] [info] [backend.cpp:18] Detected 4 Nvidia GPU(s)
[2024-08-29 16:11:07.639] [info] [hardware.h:38] Detected sm_90 compute capabilities
[2024-08-29 16:11:07.639] [info] [backend.cpp:33] Detected single engine deployment, using leader mode
[TensorRT-LLM][INFO] Engine version 0.11.1.dev20240720 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 512
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 131072
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 1024
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 9996 MiB
[TensorRT-LLM][ERROR] IRuntime::deserializeCudaEngine: Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 238, Serialized Engine Version: 237)
Error: Runtime("[TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/usr/src/text-generation-inference/target/release/build/text-generation-backends-trtllm-27bca2115f4a55c3/out/build/_deps/trtllm-src/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)")

Is there a recommended TRT-LLM version? Or a way to make it compatible?

@mfuntowicz
Copy link
Member

Awesome to hear it build successfully and cool you were able to figure out the required adaptations 😍.

Effectively, TensorRT-LLM engines are necessary not compatible from one release to another 🤐

You can find the exact TRTLLM version we are building against here: https://github.com/huggingface/text-generation-inference/blob/main/backends/trtllm/cmake/trtllm.cmake#L26 - we should more clearly document this and potentially give a warning if a discrepency is detected when loading the engine to better inform the user - adding to my todo.

The commit a681853d3803ee5893307e812530b5e7004bb6e1 might correspond to TRTLLM 0.12.0.dev2024073000 if I'm not mistaken

Please let me know if you need any additional follow up

@michaelthreet
Copy link
Author

I was able to get it to load the model by building a TensorRT-LLM model (Llama 3.1 8B Instruct for reference) using that matched TRTLLM version (0.12.0.dev2024073000) and the TRLLLM llama example.

When I send requests to the /generate endpoint, however, I'm getting some odd behavior. For example:

curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "My name is Olivier and I"
}'

{
  "generated_text": " them"
}

@mfuntowicz
Copy link
Member

Argh, interesting... I'm developping with the same model and haven't got this output

Anyway going to dig tomorrow morning and will report here, sorry for the inconvenience @michaelthreet

@michaelthreet
Copy link
Author

No worries! If you could share the model you're using (or commands you used to convert it) that might help as well. It could be that I missed a flag/parameter in the conversion process.

@michaelthreet
Copy link
Author

Some (hopefully useful) followup: It looks like the /generate path is only returning the final token in the generated_text field. The same thing also happens when using the /v1/chat/completions path when setting the stream parameter to false instead of true. I've also noticed that finish reason is always set to eos_token even when hitting the max_new_tokens limit. Some examples below:

/generate with details: true and then details:false

curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "Why is the sky blue?",
  "parameters": {
    "details": true,
    "max_new_tokens": 5,
    "temperature": 0.01
  }
}'

{
  "generated_text": " that",
  "details": {
    "finish_reason": "eos_token",
    "generated_tokens": 1,
    "seed": null,
    "prefill": [],
    "tokens": [
      {
        "id": 1115,
        "text": " This",
        "logprob": -3.9258082,
        "special": false
      },
      {
        "id": 374,
        "text": " is",
        "logprob": -3.9258082,
        "special": false
      },
      {
        "id": 264,
        "text": " a",
        "logprob": -3.9258082,
        "special": false
      },
      {
        "id": 3488,
        "text": " question",
        "logprob": -3.9258082,
        "special": false
      },
      {
        "id": 430,
        "text": " that",
        "logprob": -3.9258082,
        "special": false
      }
    ]
  }
}
curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "Why is the sky blue?",
  "parameters": {
    "details": false,
    "max_new_tokens": 5,
    "temperature": 0.01
  }
}'

{
  "generated_text": " that"
}

/v1/chat/completions with stream: true and then stream: false

curl -X 'POST' \
  'http://localhost:3000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "logprobs": false,
  "max_tokens": 5,
  "messages": [
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "model": "tgi",
  "stop": null,
  "temperature": 0.01,
  "stream": true
}'

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"logprobs":null,"finish_reason":null}]}

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" sky"},"logprobs":null,"finish_reason":null}]}

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" appears"},"logprobs":null,"finish_reason":null}]}

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" blue"},"logprobs":null,"finish_reason":null}]}

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" because"},"logprobs":null,"finish_reason":"eos_token"}]}

data: [DONE]
curl -X 'POST' \
  'http://localhost:3000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "logprobs": false,
  "max_tokens": 5,
  "messages": [
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "model": "tgi",
  "stop": null,
  "temperature": 0.01,
  "stream": false
}'

{
  "object": "chat.completion",
  "id": "",
  "created": 1725029551,
  "model": "/repository/tokenizer",
  "system_fingerprint": "2.2.1-dev0-native",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " because"
      },
      "logprobs": null,
      "finish_reason": "eos_token"
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 1,
    "total_tokens": 1
  }
}

@mfuntowicz
Copy link
Member

Sorry for the delay @michaelthreet, I've got sidetracked by something else.

Going to take a look tomorrow, thanks a ton for the additional inputs.
Reporting here shortly 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants