Skip to content

Commit 406af80

Browse files
crypdickedoakesangelinalg
authored
Various improvements to Serve Request Batching tutorial (ray-project#50400)
## Why are these changes needed? * separated inference discussion to make article flow better * added discusion on named deployments * added recreating new serve handle from different processes * added example using curl directly to HTTP endpoint * fixed imports on serve API example * cleaned up massive block of outputs * Added prerequisites section to tutorial * Clarify that we are teaching two separate options for serving models * Fix orphaned `)` ## Related issue number n/a ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Ricardo Decal <[email protected]> Signed-off-by: Ricardo Decal <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Co-authored-by: angelinalg <[email protected]>
1 parent 8e02bce commit 406af80

File tree

3 files changed

+87
-99
lines changed

3 files changed

+87
-99
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ scripts/nodes.txt
126126
.idea/**/tasks.xml
127127
.idea/dictionaries
128128
.llvm-local.bazelrc
129+
.aider*
129130

130131
# Sensitive or high-churn files:
131132
.idea/**/dataSources/

.vale/styles/config/vocabularies/General/accept.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,6 @@ Alibaba
2323
LSH
2424
BTS
2525
[Mm]ultimodal
26+
Pythonic
2627
[Gg]rafana
2728
CLI

doc/source/serve/tutorials/batch.md

Lines changed: 85 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,25 @@ orphan: true
66

77
# Serve a Text Generator with Request Batching
88

9-
This example deploys a simple text generator that takes in
10-
a batch of queries and processes them at once. In particular, it shows:
9+
This tutorial shows how to deploy a text generator that processes multiple queries simultaneously using batching.
10+
Learn how to:
1111

12-
- How to implement and deploy a Ray Serve deployment that accepts batches.
13-
- How to configure the batch size.
14-
- How to query the model in Python.
12+
- Implement a Ray Serve deployment that handles batched requests
13+
- Configure and optimize batch processing
14+
- Query the model from HTTP and Python
1515

16-
This tutorial is a guide for serving online queries when your model can take advantage of batching. For example, linear regressions and neural networks use CPU and GPU's vectorized instructions to perform computation in parallel. Performing inference with batching can increase the *throughput* of the model as well as *utilization* of the hardware.
16+
Batching can significantly improve performance when your model supports parallel processing like GPU acceleration or vectorized operations.
17+
It increases both throughput and hardware utilization by processing multiple requests together.
1718

18-
For _offline_ batch inference with large datasets, see [batch inference with Ray Data](batch_inference_home).
19+
:::{note}
20+
This tutorial focuses on online serving with batching. For offline batch processing of large datasets, see [batch inference with Ray Data](batch_inference_home).
21+
:::
22+
23+
## Prerequisites
1924

25+
```python
26+
pip install "ray[serve] transformers"
27+
```
2028

2129
## Define the Deployment
2230
Open a new Python file called `tutorial_batch.py`. First, import Ray Serve and some other helpers.
@@ -26,41 +34,47 @@ Open a new Python file called `tutorial_batch.py`. First, import Ray Serve and s
2634
:start-after: __doc_import_begin__
2735
```
2836

29-
You can use the `@serve.batch` decorator to annotate a function or a method.
30-
This annotation automatically causes calls to the function to be batched together.
31-
The function must handle a list of objects and is called with a single object.
32-
This function must also be `async def` so that you can handle multiple queries concurrently:
37+
Ray Serve provides the `@serve.batch` decorator to automatically batch individual requests to
38+
a function or class method.
39+
40+
The decorated method:
41+
- Must be `async def` to handle concurrent requests
42+
- Receives a list of requests to process together
43+
- Returns a list of results of equal length, one for each request
3344

3445
```python
3546
@serve.batch
3647
async def my_batch_handler(self, requests: List):
37-
pass
48+
# Process multiple requests together
49+
results = []
50+
for request in requests:
51+
results.append(request) # processing logic here
52+
return results
3853
```
3954

40-
The batch handler can then be called from another `async def` method in your deployment.
41-
These calls together are batched and executed together, but return an individual result as if
42-
they were a normal function call:
55+
You can call the batch handler from another `async def` method in your deployment.
56+
Ray Serve batches and executes these calls together, but returns individual results just like
57+
normal function calls:
4358

4459
```python
4560
class BatchingDeployment:
4661
@serve.batch
4762
async def my_batch_handler(self, requests: List):
4863
results = []
4964
for request in requests:
50-
results.append(request.json())
65+
results.append(request.json()) # processing logic here
5166
return results
5267

5368
async def __call__(self, request):
5469
return await self.my_batch_handler(request)
5570
```
5671

5772
:::{note}
58-
By default, Ray Serve performs *opportunistic batching*. This means that as
59-
soon as the batch handler is called, the method is executed without
60-
waiting for a full batch. If there are more queries available after this call
61-
finishes, the larger batch may be executed. You can tune this behavior using the
62-
`batch_wait_timeout_s` option to `@serve.batch` (defaults to 0). Increasing this
63-
timeout may improve throughput at the cost of latency under low load.
73+
Ray Serve uses *opportunistic batching* by default - executing requests as
74+
soon as they arrive without waiting for a full batch. You can adjust this behavior using
75+
`batch_wait_timeout_s` in the `@serve.batch` decorator to trade increased latency
76+
for increased throughput (defaults to 0). Increasing this value may improve throughput
77+
at the cost of latency under low load.
6478
:::
6579

6680
Next, define a deployment that takes in a list of input strings and runs
@@ -80,113 +94,85 @@ the maximum possible batch size that Ray Serve executes at once.
8094
:start-after: __doc_deploy_begin__
8195
```
8296

83-
## Deploy the Deployment
84-
Deploy the deployment by running the following through the terminal.
97+
## Deployment Options
98+
99+
You can deploy your app in two ways:
100+
101+
### Option 1: Deploying with the Serve Command-Line Interface
102+
```console
103+
$ serve run tutorial_batch:generator --name "Text-Completion-App"
104+
```
105+
106+
### Option 2: Deploying with the Python API
107+
108+
Alternatively, you can deploy the app using the Python API using the `serve.run` function.
109+
This command returns a handle that you can use to query the deployment.
110+
111+
```python
112+
from ray.serve.handle import DeploymentHandle
113+
114+
handle: DeploymentHandle = serve.run(generator, name="Text-Completion-App")
115+
```
116+
117+
You can now use this handle to query the model. See the [Querying the Model](#querying-the-model) section below.
118+
119+
120+
## Querying the Model
121+
122+
There are multiple ways to interact with your deployed model:
123+
124+
### 1. Simple HTTP Queries
125+
For basic testing, use curl:
126+
85127
```console
86-
$ serve run tutorial_batch:generator
128+
$ curl "http://localhost:8000/?text=Once+upon+a+time"
87129
```
88130

89-
Define a [Ray remote task](ray-remote-functions) to send queries in
90-
parallel. While Serve is running, open a separate terminal window, and run the
91-
following in an interactive Python shell or a separate Python script:
131+
### 2. Send HTTP requests in parallel with Ray
132+
For higher throughput, use [Ray remote tasks](ray-remote-functions) to send parallel requests:
92133

93134
```python
94135
import ray
95136
import requests
96-
import numpy as np
97137

98138
@ray.remote
99139
def send_query(text):
100-
resp = requests.get("http://localhost:8000/?text={}".format(text))
140+
resp = requests.post("http://localhost:8000/", params={"text": text})
101141
return resp.text
102142

103-
# Use Ray to send all queries in parallel
143+
# Example batch of queries
104144
texts = [
105145
'Once upon a time,',
106146
'Hi my name is Lewis and I like to',
107-
'My name is Mary, and my favorite',
108-
'My name is Clara and I am',
109-
'My name is Julien and I like to',
110-
'Today I accidentally',
111-
'My greatest wish is to',
112147
'In a galaxy far far away',
113-
'My best talent is',
114148
]
115-
results = ray.get([send_query.remote(text) for text in texts])
116-
print("Result returned:", results)
117-
```
118-
119-
You should get an output like the following. The first batch has a
120-
batch size of 1, and the subsequent queries have a batch size of 4. Even though the client script issues each
121-
query independently, Ray Serve evaluates them in batches.
122-
```python
123-
(pid=...) Our input array has length: 1
124-
(pid=...) Our input array has length: 4
125-
(pid=...) Our input array has length: 4
126-
Result returned: [
127-
'Once upon a time, when I got to look at and see the work of my parents (I still can\'t stand them,) they said, "Boys, you\'re going to like it if you\'ll stay away from him or make him look',
128-
129-
"Hi my name is Lewis and I like to look great. When I'm not playing against, it's when I play my best and always feel most comfortable. I get paid by the same people who make my games, who work hardest for me.",
130-
131-
"My name is Mary, and my favorite person in these two universes, the Green Lantern and the Red Lantern, are the same, except they're two of the Green Lanterns, but they also have their own different traits. Now their relationship is known",
132-
133-
'My name is Clara and I am married and live in Philadelphia. I am an English language teacher and translator. I am passionate about the issues that have so inspired me and my journey. My story begins with the discovery of my own child having been born',
134149

135-
'My name is Julien and I like to travel with my son on vacations... In fact I really prefer to spend more time with my son."\n\nIn 2011, the following year he was diagnosed with terminal Alzheimer\'s disease, and since then,',
136-
137-
"Today I accidentally got lost and went on another tour in August. My story was different, but it had so many emotions that it made me happy. I'm proud to still be able to go back to Oregon for work.\n\nFor the longest",
138-
139-
'My greatest wish is to return your loved ones to this earth where they can begin their own free and prosperous lives. This is true only on occasion as it is not intended or even encouraged to be so.\n\nThe Gospel of Luke 8:29',
140-
141-
'In a galaxy far far away, the most brilliant and powerful beings known would soon enter upon New York, setting out to restore order to the state. When the world turned against them, Darth Vader himself and Obi-Wan Kenobi, along with the Jedi',
142-
143-
'My best talent is that I can make a movie with somebody who really has a big and strong voice. I do believe that they would be great writers. I can tell you that to make sure."\n\n\nWith this in mind, "Ghostbusters'
144-
]
150+
# Send all queries in parallel
151+
results = ray.get([send_query.remote(text) for text in texts])
145152
```
146153

147-
## Deploy the Deployment using Python API
148-
If you want to evaluate a whole batch in Python, Ray Serve allows you to send
149-
queries with the Python API. A batch of queries can either come from the web server
150-
or the Python API.
151-
152-
To query the deployment with the Python API, use `serve.run()`, which is part
153-
of the Python API, instead of running `serve run` from the console. Add the following
154-
to the Python script `tutorial_batch.py`:
154+
### 3. Sending requests using DeploymentHandle
155+
For a more Pythonic way to query the model, you can use the deployment handle directly:
155156

156157
```python
157-
from ray.serve.handle import DeploymentHandle
158-
159-
handle: DeploymentHandle = serve.run(generator)
160-
)
161-
```
162-
163-
Generally, to enqueue a query, you can call `handle.method.remote(data)`. This call
164-
immediately returns a `DeploymentResponse`. You can call `.result()` to
165-
retrieve the result. Add the following to the same Python script.
158+
import ray
159+
from ray import serve
166160

167-
```python
168161
input_batch = [
169162
'Once upon a time,',
170163
'Hi my name is Lewis and I like to',
171-
'My name is Mary, and my favorite',
172-
'My name is Clara and I am',
173-
'My name is Julien and I like to',
174-
'Today I accidentally',
175-
'My greatest wish is to',
176164
'In a galaxy far far away',
177-
'My best talent is',
178165
]
179-
print("Input batch is", input_batch)
180166

181-
import ray
182-
responses = [handle.handle_batch.remote(batch) for batch in input_batch]
167+
# initialize using the 'auto' option to connect to the already-running Ray cluster
168+
ray.init(address="auto")
169+
170+
handle = serve.get_deployment_handle("BatchTextGenerator", app_name="Text-Completion-App")
171+
responses = [handle.handle_batch.remote(text) for text in input_batch]
183172
results = [r.result() for r in responses]
184-
print("Result batch is", results)
185173
```
186174

187-
Finally, run the script.
188-
```console
189-
$ python tutorial_batch.py
190-
```
175+
## Performance Considerations
191176

192-
You should get an output similar to the previous example.
177+
- Increase `max_batch_size` if you have sufficient memory and want higher throughput - this may increase latency
178+
- Increase `batch_wait_timeout_s` if throughput is more important than latency

0 commit comments

Comments
 (0)