|
2 | 2 |
|
3 | 3 | After starting the service with the scripts supplied in the TensorRT-LLM, vLLM, and In-Framework sections, the service will be in standby mode, ready to receive incoming requests. There are multiple methods available for sending queries to this service. |
4 | 4 |
|
5 | | -* Use the Query Script: Execute the query script within the currently running container. |
| 5 | +* Use the Query Script or Classes: Execute the query script or classes within the currently running container. |
6 | 6 | * PyTriton: Utilize PyTriton to send requests directly. |
7 | 7 | * HTTP Requests: Make HTTP requests using various tools or libraries. |
8 | 8 |
|
9 | 9 |
|
10 | 10 | ## Send a Query using the Script |
11 | 11 |
|
12 | | -The following example shows how to execute the query script within the currently running container. |
| 12 | +Choose the appropriate query script based on your deployment type. Each deployment method has its own specialized query script with relevant parameters. |
13 | 13 |
|
14 | | -1. To use a query script, run the following command: |
15 | 14 |
|
16 | | - ```shell |
17 | | - python /opt/Export-Deploy/scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?" |
18 | | - ``` |
19 | | - |
20 | | -2. Change the url and the ``model_name`` based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well. |
| 15 | +### General TensorRT-LLM Models |
21 | 16 |
|
22 | | -3. If the there is a prompt embedding table, run the following command to send a query: |
| 17 | +For the models deployed with TensorRT-LLM using the [deployment script described here](../nemo_models/optimized/tensorrt-llm.md): |
| 18 | + |
| 19 | +```shell |
| 20 | +python /opt/Export-Deploy/scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?" |
| 21 | +``` |
| 22 | + |
| 23 | +**Additional parameters:** |
| 24 | +- `--prompt_file`: Read prompt from file instead of command line |
| 25 | +- `--max_output_len`: Max output token length (default: 128) |
| 26 | +- `--top_k`: Top-k sampling (default: 1) |
| 27 | +- `--top_p`: Top-p sampling (default: 0.0) |
| 28 | +- `--temperature`: Sampling temperature (default: 1.0) |
| 29 | +- `--lora_task_uids`: LoRA task UIDs for LoRA-enabled models |
| 30 | +- `--stop_words_list`: List of stop words |
| 31 | +- `--bad_words_list`: List of words to avoid |
| 32 | +- `--no_repeat_ngram_size`: N-gram size for repetition penalty |
| 33 | + |
| 34 | +### In-Framework PyTorch NeMo Models |
| 35 | + |
| 36 | +For NeMo models deployed with PyTorch in-framework using the [deployment script described here](../nemo_models/in-framework.md): |
| 37 | + |
| 38 | +```shell |
| 39 | +python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?" |
| 40 | +``` |
| 41 | + |
| 42 | +**Specific parameters:** |
| 43 | +- `--compute_logprob`: Return log probabilities |
| 44 | + |
| 45 | + |
| 46 | +### In-Framework HuggingFace Models |
| 47 | + |
| 48 | +For HuggingFace models deployed with in-framework backend using the [deployment script described here](../automodel/automodel-in-framework.md): |
| 49 | + |
| 50 | +```shell |
| 51 | +python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework_hf.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?" |
| 52 | +``` |
| 53 | + |
| 54 | +**Additional parameters:** |
| 55 | +- `--output_logits`: Return raw logits from the model output |
| 56 | +- `--output_scores`: Return token probability scores from the model output |
| 57 | + |
| 58 | + |
| 59 | +### vLLM Deployments |
| 60 | + |
| 61 | +For models deployed with vLLM using the [deployment script described here](../nemo_models/optimized/vllm.md): |
| 62 | + |
| 63 | +```shell |
| 64 | +python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?" |
| 65 | +``` |
| 66 | + |
| 67 | +**vLLM-specific parameters:** |
| 68 | +- `--max_tokens`: Maximum tokens to generate (default: 16) |
| 69 | +- `--min_tokens`: Minimum tokens to generate (default: 0) |
| 70 | +- `--n_log_probs`: Number of log probabilities per output token |
| 71 | +- `--n_prompt_log_probs`: Number of log probabilities per prompt token |
| 72 | +- `--seed`: Random seed for generation |
| 73 | + |
| 74 | +**Note:** The `--max_output_len` parameter is not available in the `query_vllm.py` script. Instead, use `--max_tokens` to control the maximum number of output tokens. |
| 75 | + |
| 76 | + |
| 77 | +### TensorRT-LLM API Deployments |
| 78 | + |
| 79 | +For models deployed using TensorRT-LLM API using the [deployment script described here](../nemo_models/optimized/tensorrt-llm.md): |
| 80 | + |
| 81 | +```shell |
| 82 | +python /opt/Export-Deploy/scripts/deploy/nlp/query_trtllm_api.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?" |
| 83 | +``` |
| 84 | + |
| 85 | +**TensorRT-LLM API parameters:** |
| 86 | +- `--max_length`: Maximum length of generated sequence (default: 256) |
23 | 87 |
|
24 | | - ```shell |
25 | | - python /opt/Export-Deploy/scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?" --task_id "task 1" |
26 | | - ``` |
27 | | - |
28 | | -4. The following parameters are defined in the ``deploy_triton.py`` script: |
29 | | - |
30 | | - - ``--url``: url for the triton server. Default="0.0.0.0". |
31 | | - - ``--model_name``: name of the triton model to query. |
32 | | - - ``--prompt``: user prompt. |
33 | | - - ``--max_output_len``: Max output token length. Default=128. |
34 | | - - ``--top_k``: considers only the top N most likely tokens at each step. |
35 | | - - ``--top_p``: determines the cumulative probability distribution used for sampling the next token in the generated response. Controls the diversity of the output. |
36 | | - - ``--temperature``: controls the randomness of the generated output. Higher value, such as 1.0, leads to more randomness and diversity in the generated text, a lower value, like 0.2, produces more focused and deterministic responses. |
37 | | - - ``--task_id``: id of a task if ptuning is enabled. |
38 | 88 |
|
39 | 89 |
|
40 | 90 | ## Send a Query using the NeMo APIs |
41 | 91 |
|
42 | | -The NeMo Framework provides NemoQueryLLM APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container. |
| 92 | +The NeMo Framework provides multiple query APIs to send requests to the Triton server for different deployment types. These APIs are only accessible from the NeMo Framework container. Choose the appropriate query class based on your deployment method: |
43 | 93 |
|
44 | | -1. To run the request example using NeMo APIs, run the following command: |
| 94 | +### NemoQueryLLM (TensorRT-LLM Models) |
| 95 | + |
| 96 | +For deployed TensorRT-LLM models with comprehensive parameter support: |
| 97 | + |
| 98 | +1. To run the request example using the general NeMo API, run the following command: |
45 | 99 |
|
46 | 100 | ```python |
47 | 101 | from nemo_deploy.nlp import NemoQueryLLM |
48 | 102 |
|
49 | | - nq = NemoQueryLLM(url="localhost:8000", model_name="nemotron") |
| 103 | + nq = NemoQueryLLM(url="localhost:8000", model_name="llama") |
50 | 104 | output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_len=10, top_k=1, top_p=0.0, temperature=1.0) |
51 | 105 | print(output) |
52 | 106 | ``` |
53 | 107 |
|
54 | | -2. Change the url and the ``model_name`` based on your server and the model name of your service. Please check the NeMoQuery docstrings for details. |
55 | | - |
56 | | -3. If there is a prompt embedding table, run the following command to send a query: |
| 108 | +2. If there is a LoRA model, run the following command to send a query: |
57 | 109 |
|
58 | 110 | ```python |
59 | | - output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_len=10, top_k=1, top_p=0.0, temperature=1.0, task_id="0") |
60 | | - ``` |
| 111 | + output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_len=10, top_k=1, top_p=0.0, temperature=1.0, lora_uids=["0"]) |
| 112 | + ``` |
| 113 | + |
| 114 | +### NemoQueryLLMPyTorch (PyTorch-based Models) |
| 115 | + |
| 116 | +For PyTorch-based LLM deployments with extended parameter support: |
| 117 | + |
| 118 | +```python |
| 119 | +from nemo_deploy.nlp import NemoQueryLLMPyTorch |
| 120 | + |
| 121 | +nq = NemoQueryLLMPyTorch(url="localhost:8000", model_name="llama") |
| 122 | +output = nq.query_llm( |
| 123 | + prompts=["What is the capital of United States?"], |
| 124 | + max_length=100, |
| 125 | + top_k=1, |
| 126 | + top_p=0.0, |
| 127 | + temperature=1.0, |
| 128 | + use_greedy=True, |
| 129 | + repetition_penalty=1.0 |
| 130 | +) |
| 131 | +print(output) |
| 132 | +``` |
| 133 | + |
| 134 | +### NemoQueryLLMHF (HuggingFace Models) |
| 135 | + |
| 136 | +For HuggingFace model deployments: |
| 137 | + |
| 138 | +```python |
| 139 | +from nemo_deploy.nlp import NemoQueryLLMHF |
| 140 | + |
| 141 | +nq = NemoQueryLLMHF(url="localhost:8000", model_name="llama") |
| 142 | +output = nq.query_llm( |
| 143 | + prompts=["What is the capital of United States?"], |
| 144 | + max_length=100, |
| 145 | + top_k=1, |
| 146 | + top_p=0.0, |
| 147 | + temperature=1.0 |
| 148 | +) |
| 149 | +print(output) |
| 150 | +``` |
| 151 | + |
| 152 | +### NemoQueryTRTLLMAPI (TensorRT-LLM API) |
| 153 | + |
| 154 | +For TensorRT-LLM API deployments: |
| 155 | + |
| 156 | +```python |
| 157 | +from nemo_deploy.nlp import NemoQueryTRTLLMAPI |
| 158 | + |
| 159 | +nq = NemoQueryTRTLLMAPI(url="localhost:8000", model_name="llama") |
| 160 | +output = nq.query_llm( |
| 161 | + prompts=["What is the capital of United States?"], |
| 162 | + max_length=100, |
| 163 | + top_k=1, |
| 164 | + top_p=0.8, |
| 165 | + temperature=1.0 |
| 166 | +) |
| 167 | +print(output) |
| 168 | +``` |
| 169 | + |
| 170 | +### NemoQueryvLLM (vLLM Deployments) |
| 171 | + |
| 172 | +For vLLM deployments with OpenAI-compatible responses: |
| 173 | + |
| 174 | +```python |
| 175 | +from nemo_deploy.nlp import NemoQueryvLLM |
| 176 | + |
| 177 | +nq = NemoQueryvLLM(url="localhost:8000", model_name="llama") |
| 178 | +output = nq.query_llm( |
| 179 | + prompts=["What is the capital of United States?"], |
| 180 | + max_tokens=100, |
| 181 | + top_k=1, |
| 182 | + top_p=0.8, |
| 183 | + temperature=1.0, |
| 184 | + seed=42 |
| 185 | +) |
| 186 | +print(output) |
| 187 | +``` |
| 188 | + |
| 189 | +## Query Class Selection Guide |
| 190 | + |
| 191 | +Choose the appropriate query class based on your deployment type: |
| 192 | + |
| 193 | +- **NemoQueryLLM**: TensorRT-LLM model deployments using TensorRT-LLM engine |
| 194 | +- **NemoQueryTRTLLMAPI**: TensorRT-LLM API deployments with simplified parameter set. This is specific to TensorRT-LLM's new API to export models to TensorRT-LLM |
| 195 | +- **NemoQueryLLMPyTorch**: PyTorch-based model deployments |
| 196 | +- **NemoQueryLLMHF**: HuggingFace model deployments |
| 197 | +- **NemoQueryvLLM**: vLLM deployments that return OpenAI-compatible responses |
| 198 | + |
| 199 | + |
0 commit comments