diff --git a/cerebrium/environments/config-files.mdx b/cerebrium/environments/config-files.mdx index f6163fd1..35e1faaa 100644 --- a/cerebrium/environments/config-files.mdx +++ b/cerebrium/environments/config-files.mdx @@ -29,10 +29,10 @@ The available deployment parameters are: | --- | --- | --- | --- | | `name` | The name of your app | string | my-app | | `python_version` | The Python version available for your runtime | float | {interpreter_version}| -| `include` | Local files to include in the deployment. | string | '[./\*, main.py]' | -| `exclude` | Local Files to exclude from the deployment. | string | '[./.\*]' | +| `include` | Local files to include in the deployment. | list\[string] | \["./\*", main.py] | +| `exclude` | Local Files to exclude from the deployment. | list\[string] | \["./.\*"] | | `docker_base_image_url` | The docker base image you would like to run | string | 'debian:bookworm-slim' | -| `shell_commands` | A list of commands to run an app entrypoint script | list[string] | [] +| `shell_commands` | A list of commands to run an app entrypoint script | list\[string] | [] ## Hardware Parameters @@ -72,11 +72,12 @@ This section lets you configure how you would like your deployment to scale. You These parameters are specified under the `cerebrium.scaling` section of your config file. -| parameter | description | type | default | -| -------------- | --------------------------------------------------------------------------------------------------- | ---- | ---------- | -| `min_replicas` | The minimum number of replicas to run at all times. | int | 0 | -| `max_replicas` | The maximum number of replicas to scale to. | int | plan limit | -| `cooldown` | The number of seconds to keep your app warm after each request. It resets after every request ends. | int | 60 | +| parameter | description | type | default | +| --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- | ---------- | +| `min_replicas` | The minimum number of replicas to run at all times. | int | 0 | +| `max_replicas` | The maximum number of replicas to scale to. | int | plan limit | +| `cooldown` | The number of seconds to keep your app warm after each request. It resets after every request ends. | int | 60 | +| `replica_concurrency` | The maximum number of requests an instance of your app can handle at a time. You should ensure your deployment can handle the concurrency before setting this above 1. | int | 1 | ## Adding Dependencies @@ -120,13 +121,6 @@ An example of an apt dependency is shown below: "libglib2.0-0" = "latest" ``` -### Integrate existing requirements files - -If you have an existing **requirements.txt**, **pkglist.txt** or **conda_pkglist.txt**, files in your project, we'll prompt you to automatically integrate these into your config file when you run `cerebrium deploy`. - -This way, you can leverage external tools to manage your dependencies and have them automatically integrated into your deployment. -For example, you can use the following command to generate a `requirements.txt` file from your current environment: - ## Config File Example That was a lot of information! @@ -155,6 +149,7 @@ region = "us-east-1" min_replicas = 0 max_replicas = 2 cooldown = 60 +replica_concurrency = 1 [cerebrium.dependencies.pip] torch = ">=2.0.0" diff --git a/cerebrium/environments/custom-runtime.mdx b/cerebrium/environments/custom-runtime.mdx new file mode 100644 index 00000000..3b2883fa --- /dev/null +++ b/cerebrium/environments/custom-runtime.mdx @@ -0,0 +1,110 @@ +--- +title: Using Custom Runtimes (Beta) +description: Configure custom ASGI or WSGI runtimes +--- + + + This is a new feature! As such, the API is still currently subject to changes. + +Most applications are expected to work with the current implementation. +However, should you encounter an issue deploying a Custom Runtime please +reach out to us on Discord! + +Still on the way: + +- Websocket support +- Healthcheck grace period to prevent takedown of healthy app that registers as unhealthy + + +The default Cortex runtime can be great for getting up and running and simple use cases. However, you may already have an application built, or need +more complex functionality built into your app such as custom authentication, dynamic batching, public endpoints or websockets. +The Cerebrium platform allows you to deploy a custom python-based runtime to achieve this. To illustrate how this works, let's +take a straightforward example ASGI webserver written in FastAPI called `main.py`: + +```python +from fastapi import FastAPI + +server = FastAPI() + +# This function would map to a request to api.cortex.cerebrium.ai/project-id/app-name/hello +@server.get("/hello") +async def hello(): + return {"message": "Hello Cerebrium!"} + +# You must define an endpoint that can relay to Cerebrium that the app is ready to receive requests +@server.get("/health") +async def health(): + return "Ok" +``` + +To enable us to deploy this application, we modify our `cerebrium.toml` with a 'cerebrium.runtime.custom' section. +There are 3 parameters in this section: + +| parameter | description | type | default | +| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | ---------------------------------------------------------------------- | +| `entrypoint` | The command used to enter your application as either a list of strings or a single string. This is run from the `/cortex` directory | list\[str] | \["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8000"] | +| `port` | The port your application runs on. You must ensure this port is the same your app exposes and expects to receive traffic on | int | 8000 | +| `healthcheck_endpoint` | The endpoint the application uses to relay that it is ready to receive requests. A _200_ response is required from the endpoint in order for Cerebrium to know when the app can receive requests | string | "/readyz" | + +An example of a config section for a custom runtime for our main file may look something like this: + +```toml +[cerebrium.deployment] +name = "my-app" +python_version = "3.10" +... + +[cerebrium.runtime.custom] +entrypoint = ["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8080"] +port = 8080 +healthcheck_endpoint = "/health" + +... +``` + +An important note about entrypoints. Since your source code is in `/cortex/app`, your entrypoint must be run from the `app` directory +(e.g. if you want to run `main.py`, the entrypoint would be: `python app/main.py`). Furthermore, notice that any port used in the entrypoint +matches the specified port. + +Depending on whether you deploy an ASGI application or an app with a self-container webserver, you may need to install an ASGI runtime +to run your app just as you would usually. In this case, we are using an ASGI server (FastAPI), so we will need to install `uvicorn`. +Specify this in your dependencies: + +```toml +... + +[cerebrium.dependencies.pip] +fastapi = "latest" +uvicorn = "latest" + +... +``` + +Conversely, it is possible to run WSGI or apps with self contained servers. For example, you could deploy +a VLLM app using only the 'cerebrium.runtime.custom' and 'cerebrium.dependencies.pip' sections and **no** +Python code! + +```toml +... +# Note you can specify the entrypoint as a single string! +[cerebrium.runtime.custom] +entrypoint = "vllm serve meta-llama/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --device cuda" +port = 8000 +healthcheck_endpoint = "/health" + +[cerebrium.dependencies.pip] +torch = "latest" +vllm = "latest" + +... +``` + +Once you have made the necessary changes to your configuration, you are ready to deploy! You can deploy as normal +and our system will detect you are running a custom runtime automatically. + +```bash +cerebrium deploy -y +``` + +Your call signature is exactly the same as when you deploy a Cortex application. Every endpoint your custom server exposes will be available on +`api.cortex.cerebrium/{project-id}/{app-name}/an/example/endpoint` diff --git a/mint.json b/mint.json index 8fe8aa28..8b47534b 100644 --- a/mint.json +++ b/mint.json @@ -82,7 +82,8 @@ "cerebrium/environments/custom-images", "cerebrium/environments/using-secrets", "cerebrium/environments/multi-gpu-inferencing", - "cerebrium/environments/warm-models" + "cerebrium/environments/warm-models", + "cerebrium/environments/custom-runtime" ] }, {