Skip to content

Commit ac4f545

Browse files
Add details regarding different query scripts and classes (#356)
Signed-off-by: Onur Yilmaz <[email protected]>
1 parent e043e13 commit ac4f545

File tree

1 file changed

+170
-31
lines changed

1 file changed

+170
-31
lines changed

docs/llm/nemo_models/send-query.md

Lines changed: 170 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,59 +2,198 @@
22

33
After starting the service with the scripts supplied in the TensorRT-LLM, vLLM, and In-Framework sections, the service will be in standby mode, ready to receive incoming requests. There are multiple methods available for sending queries to this service.
44

5-
* Use the Query Script: Execute the query script within the currently running container.
5+
* Use the Query Script or Classes: Execute the query script or classes within the currently running container.
66
* PyTriton: Utilize PyTriton to send requests directly.
77
* HTTP Requests: Make HTTP requests using various tools or libraries.
88

99

1010
## Send a Query using the Script
1111

12-
The following example shows how to execute the query script within the currently running container.
12+
Choose the appropriate query script based on your deployment type. Each deployment method has its own specialized query script with relevant parameters.
1313

14-
1. To use a query script, run the following command:
1514

16-
```shell
17-
python /opt/Export-Deploy/scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?"
18-
```
19-
20-
2. Change the url and the ``model_name`` based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well.
15+
### General TensorRT-LLM Models
2116

22-
3. If the there is a prompt embedding table, run the following command to send a query:
17+
For the models deployed with TensorRT-LLM using the [deployment script described here](../nemo_models/optimized/tensorrt-llm.md):
18+
19+
```shell
20+
python /opt/Export-Deploy/scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?"
21+
```
22+
23+
**Additional parameters:**
24+
- `--prompt_file`: Read prompt from file instead of command line
25+
- `--max_output_len`: Max output token length (default: 128)
26+
- `--top_k`: Top-k sampling (default: 1)
27+
- `--top_p`: Top-p sampling (default: 0.0)
28+
- `--temperature`: Sampling temperature (default: 1.0)
29+
- `--lora_task_uids`: LoRA task UIDs for LoRA-enabled models
30+
- `--stop_words_list`: List of stop words
31+
- `--bad_words_list`: List of words to avoid
32+
- `--no_repeat_ngram_size`: N-gram size for repetition penalty
33+
34+
### In-Framework PyTorch NeMo Models
35+
36+
For NeMo models deployed with PyTorch in-framework using the [deployment script described here](../nemo_models/in-framework.md):
37+
38+
```shell
39+
python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?"
40+
```
41+
42+
**Specific parameters:**
43+
- `--compute_logprob`: Return log probabilities
44+
45+
46+
### In-Framework HuggingFace Models
47+
48+
For HuggingFace models deployed with in-framework backend using the [deployment script described here](../automodel/automodel-in-framework.md):
49+
50+
```shell
51+
python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework_hf.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?"
52+
```
53+
54+
**Additional parameters:**
55+
- `--output_logits`: Return raw logits from the model output
56+
- `--output_scores`: Return token probability scores from the model output
57+
58+
59+
### vLLM Deployments
60+
61+
For models deployed with vLLM using the [deployment script described here](../nemo_models/optimized/vllm.md):
62+
63+
```shell
64+
python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?"
65+
```
66+
67+
**vLLM-specific parameters:**
68+
- `--max_tokens`: Maximum tokens to generate (default: 16)
69+
- `--min_tokens`: Minimum tokens to generate (default: 0)
70+
- `--n_log_probs`: Number of log probabilities per output token
71+
- `--n_prompt_log_probs`: Number of log probabilities per prompt token
72+
- `--seed`: Random seed for generation
73+
74+
**Note:** The `--max_output_len` parameter is not available in the `query_vllm.py` script. Instead, use `--max_tokens` to control the maximum number of output tokens.
75+
76+
77+
### TensorRT-LLM API Deployments
78+
79+
For models deployed using TensorRT-LLM API using the [deployment script described here](../nemo_models/optimized/tensorrt-llm.md):
80+
81+
```shell
82+
python /opt/Export-Deploy/scripts/deploy/nlp/query_trtllm_api.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?"
83+
```
84+
85+
**TensorRT-LLM API parameters:**
86+
- `--max_length`: Maximum length of generated sequence (default: 256)
2387

24-
```shell
25-
python /opt/Export-Deploy/scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?" --task_id "task 1"
26-
```
27-
28-
4. The following parameters are defined in the ``deploy_triton.py`` script:
29-
30-
- ``--url``: url for the triton server. Default="0.0.0.0".
31-
- ``--model_name``: name of the triton model to query.
32-
- ``--prompt``: user prompt.
33-
- ``--max_output_len``: Max output token length. Default=128.
34-
- ``--top_k``: considers only the top N most likely tokens at each step.
35-
- ``--top_p``: determines the cumulative probability distribution used for sampling the next token in the generated response. Controls the diversity of the output.
36-
- ``--temperature``: controls the randomness of the generated output. Higher value, such as 1.0, leads to more randomness and diversity in the generated text, a lower value, like 0.2, produces more focused and deterministic responses.
37-
- ``--task_id``: id of a task if ptuning is enabled.
3888

3989

4090
## Send a Query using the NeMo APIs
4191

42-
The NeMo Framework provides NemoQueryLLM APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container.
92+
The NeMo Framework provides multiple query APIs to send requests to the Triton server for different deployment types. These APIs are only accessible from the NeMo Framework container. Choose the appropriate query class based on your deployment method:
4393

44-
1. To run the request example using NeMo APIs, run the following command:
94+
### NemoQueryLLM (TensorRT-LLM Models)
95+
96+
For deployed TensorRT-LLM models with comprehensive parameter support:
97+
98+
1. To run the request example using the general NeMo API, run the following command:
4599

46100
```python
47101
from nemo_deploy.nlp import NemoQueryLLM
48102

49-
nq = NemoQueryLLM(url="localhost:8000", model_name="nemotron")
103+
nq = NemoQueryLLM(url="localhost:8000", model_name="llama")
50104
output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_len=10, top_k=1, top_p=0.0, temperature=1.0)
51105
print(output)
52106
```
53107

54-
2. Change the url and the ``model_name`` based on your server and the model name of your service. Please check the NeMoQuery docstrings for details.
55-
56-
3. If there is a prompt embedding table, run the following command to send a query:
108+
2. If there is a LoRA model, run the following command to send a query:
57109

58110
```python
59-
output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_len=10, top_k=1, top_p=0.0, temperature=1.0, task_id="0")
60-
```
111+
output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_len=10, top_k=1, top_p=0.0, temperature=1.0, lora_uids=["0"])
112+
```
113+
114+
### NemoQueryLLMPyTorch (PyTorch-based Models)
115+
116+
For PyTorch-based LLM deployments with extended parameter support:
117+
118+
```python
119+
from nemo_deploy.nlp import NemoQueryLLMPyTorch
120+
121+
nq = NemoQueryLLMPyTorch(url="localhost:8000", model_name="llama")
122+
output = nq.query_llm(
123+
prompts=["What is the capital of United States?"],
124+
max_length=100,
125+
top_k=1,
126+
top_p=0.0,
127+
temperature=1.0,
128+
use_greedy=True,
129+
repetition_penalty=1.0
130+
)
131+
print(output)
132+
```
133+
134+
### NemoQueryLLMHF (HuggingFace Models)
135+
136+
For HuggingFace model deployments:
137+
138+
```python
139+
from nemo_deploy.nlp import NemoQueryLLMHF
140+
141+
nq = NemoQueryLLMHF(url="localhost:8000", model_name="llama")
142+
output = nq.query_llm(
143+
prompts=["What is the capital of United States?"],
144+
max_length=100,
145+
top_k=1,
146+
top_p=0.0,
147+
temperature=1.0
148+
)
149+
print(output)
150+
```
151+
152+
### NemoQueryTRTLLMAPI (TensorRT-LLM API)
153+
154+
For TensorRT-LLM API deployments:
155+
156+
```python
157+
from nemo_deploy.nlp import NemoQueryTRTLLMAPI
158+
159+
nq = NemoQueryTRTLLMAPI(url="localhost:8000", model_name="llama")
160+
output = nq.query_llm(
161+
prompts=["What is the capital of United States?"],
162+
max_length=100,
163+
top_k=1,
164+
top_p=0.8,
165+
temperature=1.0
166+
)
167+
print(output)
168+
```
169+
170+
### NemoQueryvLLM (vLLM Deployments)
171+
172+
For vLLM deployments with OpenAI-compatible responses:
173+
174+
```python
175+
from nemo_deploy.nlp import NemoQueryvLLM
176+
177+
nq = NemoQueryvLLM(url="localhost:8000", model_name="llama")
178+
output = nq.query_llm(
179+
prompts=["What is the capital of United States?"],
180+
max_tokens=100,
181+
top_k=1,
182+
top_p=0.8,
183+
temperature=1.0,
184+
seed=42
185+
)
186+
print(output)
187+
```
188+
189+
## Query Class Selection Guide
190+
191+
Choose the appropriate query class based on your deployment type:
192+
193+
- **NemoQueryLLM**: TensorRT-LLM model deployments using TensorRT-LLM engine
194+
- **NemoQueryTRTLLMAPI**: TensorRT-LLM API deployments with simplified parameter set. This is specific to TensorRT-LLM's new API to export models to TensorRT-LLM
195+
- **NemoQueryLLMPyTorch**: PyTorch-based model deployments
196+
- **NemoQueryLLMHF**: HuggingFace model deployments
197+
- **NemoQueryvLLM**: vLLM deployments that return OpenAI-compatible responses
198+
199+

0 commit comments

Comments
 (0)