Tip
For information about how to run the HTTP server, see our documentation to deploying models.
When you run a Docker image built by Cog, it serves an HTTP API for making predictions.
The server supports both synchronous and asynchronous prediction creation:
- Synchronous: The server waits until the prediction is completed and responds with the result.
- Asynchronous: The server immediately returns a response and processes the prediction in the background.
The client can create a prediction asynchronously
by setting the Prefer: respond-async
header in their request.
When provided, the server responds immediately after starting the prediction
with 202 Accepted
status and a prediction object in status processing
.
Note
The only supported way to receive updates on the status of predictions started asynchronously is using webhooks. Polling for prediction status is not currently supported.
You can also use certain server endpoints to create predictions idempotently,
such that if a client calls this endpoint more than once with the same ID
(for example, due to a network interruption)
while the prediction is still running,
no new prediction is created.
Instead, the client receives a 202 Accepted
response
with the initial state of the prediction.
Here's a summary of the prediction creation endpoints:
Endpoint | Header | Behavior |
---|---|---|
POST /predictions |
- | Synchronous, non-idempotent |
POST /predictions |
Prefer: respond-async |
Asynchronous, non-idempotent |
PUT /predictions/<prediction_id> |
- | Synchronous, idempotent |
PUT /predictions/<prediction_id> |
Prefer: respond-async |
Asynchronous, idempotent |
Choose the endpoint that best fits your needs:
- Use synchronous endpoints when you want to wait for the prediction result.
- Use asynchronous endpoints when you want to start a prediction and receive updates via webhooks.
- Use idempotent endpoints when you need to safely retry requests without creating duplicate predictions.
You can provide a webhook
parameter in the client request body
when creating a prediction.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"},
"webhook": "https://example.com/webhook/prediction"
}
The server makes requests to the provided URL with the current state of the prediction object in the request body at the following times.
start
: Once, when the prediction starts (status
isstarting
).output
: Each time a predict function generates an output (either once usingreturn
or multiple times usingyield
)logs
: Each time the predict function writes tostdout
completed
: Once, when the prediction reaches a terminal state (status
issucceeded
,canceled
, orfailed
)
Webhook requests for start
and completed
event types
are sent immediately.
Webhook requests for output
and logs
event types
are sent at most once every 500ms.
This interval is not configurable.
By default, the server sends requests for all event types.
Clients can specify which events trigger webhook requests
with the webhook_events_filter
parameter in the prediction request body.
For example,
the following request specifies that webhooks are sent by the server
only at the start and end of the prediction:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"},
"webhook": "https://example.com/webhook/prediction",
"webhook_events_filter": ["start", "completed"]
}
Endpoints for creating and canceling a prediction idempotently
accept a prediction_id
parameter in their path.
The server can run only one prediction at a time.
The client must ensure that running prediction is complete
before creating a new one with a different ID.
Clients are responsible for providing unique prediction IDs.
We recommend generating a UUIDv4 or UUIDv7,
base32-encoding that value,
and removing padding characters (==
).
This produces a random identifier that is 26 ASCII characters long.
>> from uuid import uuid4
>> from base64 import b32encode
>> b32encode(uuid4().bytes).decode('utf-8').lower().rstrip('=')
'wjx3whax6rf4vphkegkhcvpv6a'
A model's predict
function can produce file output by yielding or returning
a cog.Path
or cog.File
value.
By default, files are returned as a base64-encoded data URL.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {"prompt": "A picture of an onion with sunglasses"},
}
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "data:image/png;base64,..."
}
When creating a prediction synchronously,
the client can configure a base URL to upload output files to instead
by setting the output_file_prefix
parameter in the request body:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {"prompt": "A picture of an onion with sunglasses"},
"output_file_prefix": "https://example.com/upload",
}
When the model produces a file output, the server sends the following request to upload the file to the configured URL:
PUT /upload HTTP/1.1
Host: example.com
Content-Type: multipart/form-data
--boundary
Content-Disposition: form-data; name="file"; filename="image.png"
Content-Type: image/png
<binary data>
--boundary--
If the upload succeeds, the server responds with output:
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "http://example.com/upload/image.png"
}
If the upload fails, the server responds with an error.
Important
File uploads for predictions created asynchronously
require --upload-url
to be specified when starting the HTTP server.
The OpenAPI specification of the API, which is derived from the input and output types specified in your model's Predictor and Training objects.
Makes a single prediction.
The request body is a JSON object with the following fields:
input
: A JSON object with the same keys as the arguments to thepredict()
function. AnyFile
orPath
inputs are passed as URLs.
The response body is a JSON object with the following fields:
status
: Eithersucceeded
orfailed
.output
: The return value of thepredict()
function.error
: Ifstatus
isfailed
, the error message.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {
"image": "https://example.com/image.jpg",
"text": "Hello world!"
}
}
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "data:image/png;base64,..."
}
If the client sets the Prefer: respond-async
header in their request,
the server responds immediately after starting the prediction
with 202 Accepted
status and a prediction object in status processing
.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"}
}
HTTP/1.1 202 Accepted
Content-Type: application/json
{
"status": "starting",
}
Make a single prediction.
This is the idempotent version of the POST /predictions
endpoint.
PUT /predictions/wjx3whax6rf4vphkegkhcvpv6a HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {"prompt": "A picture of an onion with sunglasses"}
}
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "data:image/png;base64,..."
}
If the client sets the Prefer: respond-async
header in their request,
the server responds immediately after starting the prediction
with 202 Accepted
status and a prediction object in status processing
.
PUT /predictions/wjx3whax6rf4vphkegkhcvpv6a HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"}
}
HTTP/1.1 202 Accepted
Content-Type: application/json
{
"id": "wjx3whax6rf4vphkegkhcvpv6a",
"status": "starting"
}
A client can cancel an asynchronous prediction by making a
POST /predictions/<prediction_id>/cancel
request
using the prediction id
provided when the prediction was created.
For example, if the client creates a prediction by sending the request:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"id": "abcd1234",
"input": {"prompt": "A picture of an onion with sunglasses"},
}
The client can cancel the prediction by sending the request:
POST /predictions/abcd1234/cancel HTTP/1.1
A prediction cannot be canceled if it's
created synchronously, without the Prefer: respond-async
header,
or created without a provided id
.
If a prediction exists with the provided id
,
the server responds with status 200 OK
.
Otherwise, the server responds with status 404 Not Found
.
When a prediction is canceled,
Cog raises cog.server.exceptions.CancelationException
in the model's predict
function.
This exception may be caught by the model to perform necessary cleanup.
The cleanup should be brief, ideally completing within a few seconds.
After cleanup, the exception must be re-raised using a bare raise statement.
Failure to re-raise the exception may result in the termination of the container.
from cog import Path
from cog.server.exceptions import CancelationException
def predict(image: Path) -> Path:
try:
return process(image)
except CancelationException as e:
cleanup()
raise e