-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guide on how to use TensorRT-LLM Backend #2466
Comments
Hi @michaelthreet 👋 Very good questions. And indeed we haven't yet documented that well how the new backend design works. Basically the best guide is looking at the info in the dockerfile. But I'll loop in @mfuntowicz, he can better show you in the right direction and point what are the system requirements 👍 |
Hi @michaelthreet - thanks for your interest in the TRTLLM backend. The overall backend is pretty new and might suffer from edge cases not being handled but it should be usable. ➡️ #2357 As I mentioned, the overall backend is still WIP and I would not qualify it as "stable" so we do not offer prebuild images yet.
Let us know if you encounter any issues for building 😊. Finally, when you've got the container ready, you should be able to deploy it using the following:
Please let us know if you encounter any blocker, more than happy to help and get your feedback |
Thanks @mfuntowicz, that's all great info! I was able to build the image and run it, although with a modified command to account for the required args. I have a directory within the engine directory that contains the tokenizer, hence using
I'm seeing this error, however, and I'm assuming it's due to a mismatch in the TRT-LLM version the engine was compiled with and the version running in this TGI image.
Is there a recommended TRT-LLM version? Or a way to make it compatible? |
Awesome to hear it build successfully and cool you were able to figure out the required adaptations 😍. Effectively, TensorRT-LLM engines are necessary not compatible from one release to another 🤐 You can find the exact TRTLLM version we are building against here: https://github.com/huggingface/text-generation-inference/blob/main/backends/trtllm/cmake/trtllm.cmake#L26 - we should more clearly document this and potentially give a warning if a discrepency is detected when loading the engine to better inform the user - adding to my todo. The commit a681853d3803ee5893307e812530b5e7004bb6e1 might correspond to TRTLLM Please let me know if you need any additional follow up |
I was able to get it to load the model by building a TensorRT-LLM model (Llama 3.1 8B Instruct for reference) using that matched TRTLLM version ( When I send requests to the
|
Argh, interesting... I'm developping with the same model and haven't got this output Anyway going to dig tomorrow morning and will report here, sorry for the inconvenience @michaelthreet |
No worries! If you could share the model you're using (or commands you used to convert it) that might help as well. It could be that I missed a flag/parameter in the conversion process. |
Some (hopefully useful) followup: It looks like the
|
Sorry for the delay @michaelthreet, I've got sidetracked by something else. Going to take a look tomorrow, thanks a ton for the additional inputs. |
Feature request
Does any documentation exist, or would it be possible to add documentation, on how to use the TensorRT-LLM backend? #2458 makes mention that the TRT-LLM backend exists, and I can see that there's a Dockerfile for TRT-LLM, but I don't see any guides on how to build/use it.
Motivation
I would like to run TensorRT-LLM models using TGI.
Your contribution
I'm willing to test any builds/processes/pipelines that are available.
The text was updated successfully, but these errors were encountered: