Skip to content

Commit

Permalink
Merge pull request #183 from CerebriumAI/eli/custom-runtime-docs
Browse files Browse the repository at this point in the history
Eli/custom runtime docs
  • Loading branch information
elijah-rou authored Oct 24, 2024
2 parents 858d361 + 62cfc14 commit e803b30
Show file tree
Hide file tree
Showing 3 changed files with 122 additions and 16 deletions.
25 changes: 10 additions & 15 deletions cerebrium/environments/config-files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ The available deployment parameters are:
| --- | --- | --- | --- |
| `name` | The name of your app | string | my-app |
| `python_version` | The Python version available for your runtime | float | {interpreter_version}|
| `include` | Local files to include in the deployment. | string | '[./\*, main.py]' |
| `exclude` | Local Files to exclude from the deployment. | string | '[./.\*]' |
| `include` | Local files to include in the deployment. | list\[string] | \["./\*", main.py] |
| `exclude` | Local Files to exclude from the deployment. | list\[string] | \["./.\*"] |
| `docker_base_image_url` | The docker base image you would like to run | string | 'debian:bookworm-slim' |
| `shell_commands` | A list of commands to run an app entrypoint script | list[string] | []
| `shell_commands` | A list of commands to run an app entrypoint script | list\[string] | []

## Hardware Parameters

Expand Down Expand Up @@ -72,11 +72,12 @@ This section lets you configure how you would like your deployment to scale. You

These parameters are specified under the `cerebrium.scaling` section of your config file.

| parameter | description | type | default |
| -------------- | --------------------------------------------------------------------------------------------------- | ---- | ---------- |
| `min_replicas` | The minimum number of replicas to run at all times. | int | 0 |
| `max_replicas` | The maximum number of replicas to scale to. | int | plan limit |
| `cooldown` | The number of seconds to keep your app warm after each request. It resets after every request ends. | int | 60 |
| parameter | description | type | default |
| --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- | ---------- |
| `min_replicas` | The minimum number of replicas to run at all times. | int | 0 |
| `max_replicas` | The maximum number of replicas to scale to. | int | plan limit |
| `cooldown` | The number of seconds to keep your app warm after each request. It resets after every request ends. | int | 60 |
| `replica_concurrency` | The maximum number of requests an instance of your app can handle at a time. You should ensure your deployment can handle the concurrency before setting this above 1. | int | 1 |

## Adding Dependencies

Expand Down Expand Up @@ -120,13 +121,6 @@ An example of an apt dependency is shown below:
"libglib2.0-0" = "latest"
```

### Integrate existing requirements files

If you have an existing **requirements.txt**, **pkglist.txt** or **conda_pkglist.txt**, files in your project, we'll prompt you to automatically integrate these into your config file when you run `cerebrium deploy`.

This way, you can leverage external tools to manage your dependencies and have them automatically integrated into your deployment.
For example, you can use the following command to generate a `requirements.txt` file from your current environment:

## Config File Example

That was a lot of information!
Expand Down Expand Up @@ -155,6 +149,7 @@ region = "us-east-1"
min_replicas = 0
max_replicas = 2
cooldown = 60
replica_concurrency = 1

[cerebrium.dependencies.pip]
torch = ">=2.0.0"
Expand Down
110 changes: 110 additions & 0 deletions cerebrium/environments/custom-runtime.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: Using Custom Runtimes (Beta)
description: Configure custom ASGI or WSGI runtimes
---

<Note>
This is a new feature! As such, the API is still currently subject to changes.

Most applications are expected to work with the current implementation.
However, should you encounter an issue deploying a Custom Runtime please
reach out to us on Discord!

Still on the way:

- Websocket support
- Healthcheck grace period to prevent takedown of healthy app that registers as unhealthy
</Note>

The default Cortex runtime can be great for getting up and running and simple use cases. However, you may already have an application built, or need
more complex functionality built into your app such as custom authentication, dynamic batching, public endpoints or websockets.
The Cerebrium platform allows you to deploy a custom python-based runtime to achieve this. To illustrate how this works, let's
take a straightforward example ASGI webserver written in FastAPI called `main.py`:

```python
from fastapi import FastAPI

server = FastAPI()

# This function would map to a request to api.cortex.cerebrium.ai/project-id/app-name/hello
@server.get("/hello")
async def hello():
return {"message": "Hello Cerebrium!"}

# You must define an endpoint that can relay to Cerebrium that the app is ready to receive requests
@server.get("/health")
async def health():
return "Ok"
```

To enable us to deploy this application, we modify our `cerebrium.toml` with a 'cerebrium.runtime.custom' section.
There are 3 parameters in this section:

| parameter | description | type | default |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | ---------------------------------------------------------------------- |
| `entrypoint` | The command used to enter your application as either a list of strings or a single string. This is run from the `/cortex` directory | list\[str] | \["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8000"] |
| `port` | The port your application runs on. You must ensure this port is the same your app exposes and expects to receive traffic on | int | 8000 |
| `healthcheck_endpoint` | The endpoint the application uses to relay that it is ready to receive requests. A _200_ response is required from the endpoint in order for Cerebrium to know when the app can receive requests | string | "/readyz" |

An example of a config section for a custom runtime for our main file may look something like this:

```toml
[cerebrium.deployment]
name = "my-app"
python_version = "3.10"
...

[cerebrium.runtime.custom]
entrypoint = ["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8080"]
port = 8080
healthcheck_endpoint = "/health"

...
```

An important note about entrypoints. Since your source code is in `/cortex/app`, your entrypoint must be run from the `app` directory
(e.g. if you want to run `main.py`, the entrypoint would be: `python app/main.py`). Furthermore, notice that any port used in the entrypoint
matches the specified port.

Depending on whether you deploy an ASGI application or an app with a self-container webserver, you may need to install an ASGI runtime
to run your app just as you would usually. In this case, we are using an ASGI server (FastAPI), so we will need to install `uvicorn`.
Specify this in your dependencies:

```toml
...

[cerebrium.dependencies.pip]
fastapi = "latest"
uvicorn = "latest"

...
```

Conversely, it is possible to run WSGI or apps with self contained servers. For example, you could deploy
a VLLM app using only the 'cerebrium.runtime.custom' and 'cerebrium.dependencies.pip' sections and **no**
Python code!

```toml
...
# Note you can specify the entrypoint as a single string!
[cerebrium.runtime.custom]
entrypoint = "vllm serve meta-llama/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --device cuda"
port = 8000
healthcheck_endpoint = "/health"

[cerebrium.dependencies.pip]
torch = "latest"
vllm = "latest"

...
```

Once you have made the necessary changes to your configuration, you are ready to deploy! You can deploy as normal
and our system will detect you are running a custom runtime automatically.

```bash
cerebrium deploy -y
```

Your call signature is exactly the same as when you deploy a Cortex application. Every endpoint your custom server exposes will be available on
`api.cortex.cerebrium/{project-id}/{app-name}/an/example/endpoint`
3 changes: 2 additions & 1 deletion mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,8 @@
"cerebrium/environments/custom-images",
"cerebrium/environments/using-secrets",
"cerebrium/environments/multi-gpu-inferencing",
"cerebrium/environments/warm-models"
"cerebrium/environments/warm-models",
"cerebrium/environments/custom-runtime"
]
},
{
Expand Down

0 comments on commit e803b30

Please sign in to comment.