Merge pull request #183 from CerebriumAI/eli/custom-runtime-docs

Eli/custom runtime docs
CerebriumAI · Oct 24, 2024 · e803b30 · e803b30
2 parents 858d361 + 62cfc14
commit e803b30
Show file tree

Hide file tree

Showing 3 changed files with 122 additions and 16 deletions.
diff --git a/cerebrium/environments/config-files.mdx b/cerebrium/environments/config-files.mdx
@@ -29,10 +29,10 @@ The available deployment parameters are:
 | --- | --- | --- | --- |
 | `name` | The name of your app | string | my-app |
 | `python_version` | The Python version available for your runtime | float | {interpreter_version}|
-| `include` | Local files to include in the deployment. | string | '[./\*, main.py]' |
-| `exclude` | Local Files to exclude from the deployment. | string | '[./.\*]' |
+| `include` | Local files to include in the deployment. | list\[string] | \["./\*", main.py] |
+| `exclude` | Local Files to exclude from the deployment. | list\[string] | \["./.\*"] |
 | `docker_base_image_url` | The docker base image you would like to run | string | 'debian:bookworm-slim' |
-| `shell_commands` | A list of commands to run an app entrypoint script | list[string] | []
+| `shell_commands` | A list of commands to run an app entrypoint script | list\[string] | []
 
 ## Hardware Parameters
 
@@ -72,11 +72,12 @@ This section lets you configure how you would like your deployment to scale. You
 
 These parameters are specified under the `cerebrium.scaling` section of your config file.
 
-| parameter      | description                                                                                         | type | default    |
-| -------------- | --------------------------------------------------------------------------------------------------- | ---- | ---------- |
-| `min_replicas` | The minimum number of replicas to run at all times.                                                 | int  | 0          |
-| `max_replicas` | The maximum number of replicas to scale to.                                                         | int  | plan limit |
-| `cooldown`     | The number of seconds to keep your app warm after each request. It resets after every request ends. | int  | 60         |
+| parameter             | description                                                                                                                                                            | type | default    |
+| --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- | ---------- |
+| `min_replicas`        | The minimum number of replicas to run at all times.                                                                                                                    | int  | 0          |
+| `max_replicas`        | The maximum number of replicas to scale to.                                                                                                                            | int  | plan limit |
+| `cooldown`            | The number of seconds to keep your app warm after each request. It resets after every request ends.                                                                    | int  | 60         |
+| `replica_concurrency` | The maximum number of requests an instance of your app can handle at a time. You should ensure your deployment can handle the concurrency before setting this above 1. | int  | 1          |
 
 ## Adding Dependencies
 
@@ -120,13 +121,6 @@ An example of an apt dependency is shown below:
 "libglib2.0-0" = "latest"
 ```
 
-### Integrate existing requirements files
-
-If you have an existing **requirements.txt**, **pkglist.txt** or **conda_pkglist.txt**, files in your project, we'll prompt you to automatically integrate these into your config file when you run `cerebrium deploy`.
-
-This way, you can leverage external tools to manage your dependencies and have them automatically integrated into your deployment.  
-For example, you can use the following command to generate a `requirements.txt` file from your current environment:
-
 ## Config File Example
 
 That was a lot of information!  
@@ -155,6 +149,7 @@ region = "us-east-1"
 min_replicas = 0
 max_replicas = 2
 cooldown = 60
+replica_concurrency = 1
 
 [cerebrium.dependencies.pip]
 torch = ">=2.0.0"

diff --git a/cerebrium/environments/custom-runtime.mdx b/cerebrium/environments/custom-runtime.mdx
@@ -0,0 +1,110 @@
+---
+title: Using Custom Runtimes (Beta)
+description: Configure custom ASGI or WSGI runtimes
+---
+
+<Note>
+  This is a new feature! As such, the API is still currently subject to changes.
+
+Most applications are expected to work with the current implementation.
+However, should you encounter an issue deploying a Custom Runtime please
+reach out to us on Discord!
+
+Still on the way:
+
+- Websocket support
+- Healthcheck grace period to prevent takedown of healthy app that registers as unhealthy
+  </Note>
+
+The default Cortex runtime can be great for getting up and running and simple use cases. However, you may already have an application built, or need
+more complex functionality built into your app such as custom authentication, dynamic batching, public endpoints or websockets.
+The Cerebrium platform allows you to deploy a custom python-based runtime to achieve this. To illustrate how this works, let's
+take a straightforward example ASGI webserver written in FastAPI called `main.py`:
+
+```python
+from fastapi import FastAPI
+
+server = FastAPI()
+
+# This function would map to a request to api.cortex.cerebrium.ai/project-id/app-name/hello
+@server.get("/hello")
+async def hello():
+    return {"message": "Hello Cerebrium!"}
+
+# You must define an endpoint that can relay to Cerebrium that the app is ready to receive requests
+@server.get("/health")
+async def health():
+    return "Ok"
+```
+
+To enable us to deploy this application, we modify our `cerebrium.toml` with a 'cerebrium.runtime.custom' section.
+There are 3 parameters in this section:
+
+| parameter              | description                                                                                                                                                                                      | type       | default                                                                |
+| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | ---------------------------------------------------------------------- |
+| `entrypoint`           | The command used to enter your application as either a list of strings or a single string. This is run from the `/cortex` directory                                                              | list\[str] | \["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8000"] |
+| `port`                 | The port your application runs on. You must ensure this port is the same your app exposes and expects to receive traffic on                                                                      | int        | 8000                                                                   |
+| `healthcheck_endpoint` | The endpoint the application uses to relay that it is ready to receive requests. A _200_ response is required from the endpoint in order for Cerebrium to know when the app can receive requests | string     | "/readyz"                                                              |
+
+An example of a config section for a custom runtime for our main file may look something like this:
+
+```toml
+[cerebrium.deployment]
+name = "my-app"
+python_version = "3.10"
+...
+
+[cerebrium.runtime.custom]
+entrypoint = ["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8080"]
+port = 8080
+healthcheck_endpoint = "/health"
+
+...
+```
+
+An important note about entrypoints. Since your source code is in `/cortex/app`, your entrypoint must be run from the `app` directory
+(e.g. if you want to run `main.py`, the entrypoint would be: `python app/main.py`). Furthermore, notice that any port used in the entrypoint
+matches the specified port.
+
+Depending on whether you deploy an ASGI application or an app with a self-container webserver, you may need to install an ASGI runtime
+to run your app just as you would usually. In this case, we are using an ASGI server (FastAPI), so we will need to install `uvicorn`.
+Specify this in your dependencies:
+
+```toml
+...
+
+[cerebrium.dependencies.pip]
+fastapi = "latest"
+uvicorn = "latest"
+
+...
+```
+
+Conversely, it is possible to run WSGI or apps with self contained servers. For example, you could deploy
+a VLLM app using only the 'cerebrium.runtime.custom' and 'cerebrium.dependencies.pip' sections and **no**
+Python code!
+
+```toml
+...
+# Note you can specify the entrypoint as a single string!
+[cerebrium.runtime.custom]
+entrypoint = "vllm serve meta-llama/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --device cuda"
+port = 8000
+healthcheck_endpoint = "/health"
+
+[cerebrium.dependencies.pip]
+torch = "latest"
+vllm = "latest"
+
+...
+```
+
+Once you have made the necessary changes to your configuration, you are ready to deploy! You can deploy as normal
+and our system will detect you are running a custom runtime automatically.
+
+```bash
+cerebrium deploy -y
+```
+
+Your call signature is exactly the same as when you deploy a Cortex application. Every endpoint your custom server exposes will be available on
+`api.cortex.cerebrium/{project-id}/{app-name}/an/example/endpoint`
diff --git a/mint.json b/mint.json
@@ -82,7 +82,8 @@
         "cerebrium/environments/custom-images",
         "cerebrium/environments/using-secrets",
         "cerebrium/environments/multi-gpu-inferencing",
-        "cerebrium/environments/warm-models"
+        "cerebrium/environments/warm-models",
+        "cerebrium/environments/custom-runtime"
       ]
     },
     {