From 912a6d03b67dbd8a51de21df3dbda1d5125cd773 Mon Sep 17 00:00:00 2001
From: Elijah Roussos <elijah.rou@gmail.com>
Date: Tue, 22 Oct 2024 15:11:19 -0400
Subject: [PATCH 1/6] feat: custom runtime docs

---
 cerebrium/environments/config-files.mdx   |  25 ++---
 cerebrium/environments/custom-runtime.mdx | 112 ++++++++++++++++++++++
 mint.json                                 |  32 +++++--
 3 files changed, 146 insertions(+), 23 deletions(-)
 create mode 100644 cerebrium/environments/custom-runtime.mdx

diff --git a/cerebrium/environments/config-files.mdx b/cerebrium/environments/config-files.mdx
index b99a3516..6d9f66da 100644
--- a/cerebrium/environments/config-files.mdx
+++ b/cerebrium/environments/config-files.mdx
@@ -29,10 +29,10 @@ The available deployment parameters are:
 | --- | --- | --- | --- |
 | `name` | The name of your app | string | my-app |
 | `python_version` | The Python version available for your runtime | float | {interpreter_version}|
-| `include` | Local files to include in the deployment. | string | '[./\*, main.py]' |
-| `exclude` | Local Files to exclude from the deployment. | string | '[./.\*]' |
+| `include` | Local files to include in the deployment. | list\[string] | \["./\*", main.py] |
+| `exclude` | Local Files to exclude from the deployment. | list\[string] | \["./.\*"] |
 | `docker_base_image_url` | The docker base image you would like to run | string | 'debian:bookworm-slim' |
-| `shell_commands` | A list of commands to run an app entrypoint script | list[string] | []
+| `shell_commands` | A list of commands to run an app entrypoint script | list\[string] | []
 
 ## Hardware Parameters
 
@@ -72,11 +72,12 @@ This section lets you configure how you would like your deployment to scale. You
 
 These parameters are specified under the `cerebrium.scaling` section of your config file.
 
-| parameter      | description                                                                                         | type | default    |
-| -------------- | --------------------------------------------------------------------------------------------------- | ---- | ---------- |
-| `min_replicas` | The minimum number of replicas to run at all times.                                                 | int  | 0          |
-| `max_replicas` | The maximum number of replicas to scale to.                                                         | int  | plan limit |
-| `cooldown`     | The number of seconds to keep your app warm after each request. It resets after every request ends. | int  | 60         |
+| parameter             | description                                                                                                                                                            | type | default    |
+| --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- | ---------- |
+| `min_replicas`        | The minimum number of replicas to run at all times.                                                                                                                    | int  | 0          |
+| `max_replicas`        | The maximum number of replicas to scale to.                                                                                                                            | int  | plan limit |
+| `cooldown`            | The number of seconds to keep your app warm after each request. It resets after every request ends.                                                                    | int  | 60         |
+| `replica_concurrency` | The maximum number of requests an instance of your app can handle at a time. You should ensure your deployment can handle the concurrency before setting this above 1. | int  | 1          |
 
 ## Adding Dependencies
 
@@ -120,13 +121,6 @@ An example of an apt dependency is shown below:
 "libglib2.0-0" = "latest"
 ```
 
-### Integrate existing requirements files
-
-If you have an existing **requirements.txt**, **pkglist.txt** or **conda_pkglist.txt**, files in your project, we'll prompt you to automatically integrate these into your config file when you run `cerebrium deploy`.
-
-This way, you can leverage external tools to manage your dependencies and have them automatically integrated into your deployment.  
-For example, you can use the following command to generate a `requirements.txt` file from your current environment:
-
 ## Config File Example
 
 That was a lot of information!  
@@ -137,6 +131,7 @@ Below is an example of a config file that takes advantage of all the features we
 ```toml
 [cerebrium.deployment]
 name = "my-app"
+runtime = "custom"
 python_version = "3.10"
 include = "[./*, main.py]"
 exclude = "[./.*, ./__*]"
diff --git a/cerebrium/environments/custom-runtime.mdx b/cerebrium/environments/custom-runtime.mdx
new file mode 100644
index 00000000..1d6499d2
--- /dev/null
+++ b/cerebrium/environments/custom-runtime.mdx
@@ -0,0 +1,112 @@
+---
+title: Using Custom Runtimes (Beta)
+description: Configure custom ASGI or WSGI runtimes
+---
+
+The default Cortex runtime can be great for getting up and running and simple use cases. However, you may already have an application built, or need
+more complex functionality built into your app such as custom authentication, dynamic batching, public endpoints or websockets.
+The Cerebrium platform allows you to deploy a custom python-based runtime to achieve this. To illustrate how this works, let's
+take a straightforward example ASGI webserver written in FastAPI called `main.py`:
+
+```python
+from fastapi import FastAPI
+
+server = FastAPI()
+
+@server.get("/hello")
+async def hello():
+    return {"message": "Hello Cerebrium!"}
+
+@server.get("/health")
+asycn def health():
+    return "Ok"
+```
+
+To enable us to deploy this application, we modify our `cerebrium.toml` with a 'cerebrium.runtime.custom' section.
+There are 3 parameters in this section:
+
+| parameter              | description                                                                         | type       | default                                                                |
+| ---------------------- | ----------------------------------------------------------------------------------- | ---------- | ---------------------------------------------------------------------- |
+| `entrypoint`           | The command used to enter your application as a list of strings, run from `/cortex` | list\[str] | \["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8000"] |
+| `port`                 | The port your application runs on                                                   | int        | 8000                                                                   |
+| `healthcheck_endpoint` | The endpoint the application uses to relay that it is ready to receive requests.    | string     | "/readyz"                                                              |
+
+An example of a config section for a custom runtime for our main file may look something like this:
+
+```toml
+[cerebrium.deployment]
+name = "my-app"
+python_version = "3.10"
+...
+
+[cerebrium.runtime.custom]
+entrypoint = ["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8080"]
+port = 8080
+healthcheck_endpoint = "/health"
+
+...
+```
+
+An important note about entrypoints. Since your source code is in `/cortex/app`, your entrypoint must be run from the `app` directory
+(e.g. if you want to run `main.py`, the entrypoint would be: `python app/main.py`). Furthermore, notice that any port used in the entrypoint
+matches the specified port.
+
+Depending on whether you deploy an ASGI application or an app with a self-container webserver, you may need to install an ASGI runtime
+to run your app just as you would usually. In this case, we are using an ASGI server (FastAPI), so we will need to install `uvicorn`.
+Specify this in your dependencies:
+
+```toml
+...
+
+[cerebrium.dependencies.pip]
+fastapi = "latest"
+uvicorn = "latest"
+
+...
+```
+
+Conversely, it is possible to run WSGI or apps with self contained servers. For example, you could deploy
+a VLLM app using only the 'cerebrium.runtime.custom' and 'cerebrium.dependencies.pip' sections and **no**
+Python code!
+
+```toml
+...
+
+[cerebrium.runtime.custom]
+entrypoint = [
+  'vllm',
+  'serve',
+  'meta-llama/Meta-Llama-3-8B-Instruct',
+  '--host',
+  '0.0.0.0',
+  '--port',
+  '8000',
+  '--device',
+  'cuda',
+  '--dtype',
+  'auto',
+  '--trust-remote-code',
+  '--enable-chunked-prefill',
+  '--disable-fastapi-docs',
+  '--max_num_batched_tokens',
+  '1024'
+]
+port = 8000
+healthcheck_endpoint = "/health"
+
+[cerebrium.dependencies.pip]
+torch = "latest"
+vllm = "latest"
+
+...
+```
+
+Once you have made the necessary changes to your configuration, you are ready to deploy! You can deploy as normal
+and our system will detect you are running a custom runtime automatically.
+
+```bash
+cerebrium deploy -y
+```
+
+Your call signature is exactly the same as when you deploy a Cortex application. Every endpoint your custom server exposes will be available on
+`api.cortex.cerebrium/{project-id}/{app-name}/an/example/endpoint`
diff --git a/mint.json b/mint.json
index 8fe8aa28..9fa987d3 100644
--- a/mint.json
+++ b/mint.json
@@ -4,7 +4,9 @@
     "light": "/logo/light.svg",
     "dark": "/logo/dark.svg"
   },
-  "versions": ["v4"],
+  "versions": [
+    "v4"
+  ],
   "favicon": "/favicon.png",
   "colors": {
     "primary": "#EB3A6F",
@@ -82,16 +84,21 @@
         "cerebrium/environments/custom-images",
         "cerebrium/environments/using-secrets",
         "cerebrium/environments/multi-gpu-inferencing",
-        "cerebrium/environments/warm-models"
+        "cerebrium/environments/warm-models",
+        "cerebrium/environments/custom-runtime"
       ]
     },
     {
       "group": "Data Storage",
-      "pages": ["cerebrium/data-sharing-storage/persistent-storage"]
+      "pages": [
+        "cerebrium/data-sharing-storage/persistent-storage"
+      ]
     },
     {
       "group": "Deployments",
-      "pages": ["cerebrium/deployments/ci-cd"]
+      "pages": [
+        "cerebrium/deployments/ci-cd"
+      ]
     },
     {
       "group": "Endpoints",
@@ -103,15 +110,21 @@
     },
     {
       "group": "Integrations",
-      "pages": ["cerebrium/integrations/vercel"]
+      "pages": [
+        "cerebrium/integrations/vercel"
+      ]
     },
     {
       "group": "Misc",
-      "pages": ["cerebrium/misc/faster-model-loading"]
+      "pages": [
+        "cerebrium/misc/faster-model-loading"
+      ]
     },
     {
       "group": "Prebuilt Models",
-      "pages": ["cerebrium/prebuilt-models/introduction"]
+      "pages": [
+        "cerebrium/prebuilt-models/introduction"
+      ]
     },
     {
       "group": "FAQs and Tips",
@@ -140,7 +153,10 @@
     },
     {
       "group": "Migrations",
-      "pages": ["migrations/replicate", "migrations/hugging-face"]
+      "pages": [
+        "migrations/replicate",
+        "migrations/hugging-face"
+      ]
     }
   ],
   "analytics": {

From 046f91e844a71c3ca623bb97299427d5e9b1626c Mon Sep 17 00:00:00 2001
From: elijah-rou <elijah-rou@users.noreply.github.com>
Date: Tue, 22 Oct 2024 19:12:42 +0000
Subject: [PATCH 2/6] Prettified Code!

---
 mint.json | 29 +++++++----------------------
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/mint.json b/mint.json
index 9fa987d3..8b47534b 100644
--- a/mint.json
+++ b/mint.json
@@ -4,9 +4,7 @@
     "light": "/logo/light.svg",
     "dark": "/logo/dark.svg"
   },
-  "versions": [
-    "v4"
-  ],
+  "versions": ["v4"],
   "favicon": "/favicon.png",
   "colors": {
     "primary": "#EB3A6F",
@@ -90,15 +88,11 @@
     },
     {
       "group": "Data Storage",
-      "pages": [
-        "cerebrium/data-sharing-storage/persistent-storage"
-      ]
+      "pages": ["cerebrium/data-sharing-storage/persistent-storage"]
     },
     {
       "group": "Deployments",
-      "pages": [
-        "cerebrium/deployments/ci-cd"
-      ]
+      "pages": ["cerebrium/deployments/ci-cd"]
     },
     {
       "group": "Endpoints",
@@ -110,21 +104,15 @@
     },
     {
       "group": "Integrations",
-      "pages": [
-        "cerebrium/integrations/vercel"
-      ]
+      "pages": ["cerebrium/integrations/vercel"]
     },
     {
       "group": "Misc",
-      "pages": [
-        "cerebrium/misc/faster-model-loading"
-      ]
+      "pages": ["cerebrium/misc/faster-model-loading"]
     },
     {
       "group": "Prebuilt Models",
-      "pages": [
-        "cerebrium/prebuilt-models/introduction"
-      ]
+      "pages": ["cerebrium/prebuilt-models/introduction"]
     },
     {
       "group": "FAQs and Tips",
@@ -153,10 +141,7 @@
     },
     {
       "group": "Migrations",
-      "pages": [
-        "migrations/replicate",
-        "migrations/hugging-face"
-      ]
+      "pages": ["migrations/replicate", "migrations/hugging-face"]
     }
   ],
   "analytics": {

From 2d86c44b8b1310c066f04d044b6fb51a2b3b0fa3 Mon Sep 17 00:00:00 2001
From: Elijah Roussos <elijah.rou@gmail.com>
Date: Tue, 22 Oct 2024 15:23:15 -0400
Subject: [PATCH 3/6] fix config

---
 cerebrium/environments/config-files.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cerebrium/environments/config-files.mdx b/cerebrium/environments/config-files.mdx
index 96805ba4..35e1faaa 100644
--- a/cerebrium/environments/config-files.mdx
+++ b/cerebrium/environments/config-files.mdx
@@ -131,7 +131,6 @@ Below is an example of a config file that takes advantage of all the features we
 ```toml
 [cerebrium.deployment]
 name = "my-app"
-runtime = "custom"
 python_version = "3.10"
 include = ["./*", "main.py"]
 exclude = ["./.*", "./__*"]
@@ -150,6 +149,7 @@ region = "us-east-1"
 min_replicas = 0
 max_replicas = 2
 cooldown = 60
+replica_concurrency = 1
 
 [cerebrium.dependencies.pip]
 torch = ">=2.0.0"

From 3cae4aaa5f3495581d1a51a17c093e1b8de34d7d Mon Sep 17 00:00:00 2001
From: Elijah Roussos <elijah.rou@gmail.com>
Date: Tue, 22 Oct 2024 15:49:51 -0400
Subject: [PATCH 4/6] single string example

---
 cerebrium/environments/custom-runtime.mdx | 21 ++-------------------
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/cerebrium/environments/custom-runtime.mdx b/cerebrium/environments/custom-runtime.mdx
index 1d6499d2..098b34b0 100644
--- a/cerebrium/environments/custom-runtime.mdx
+++ b/cerebrium/environments/custom-runtime.mdx
@@ -71,26 +71,9 @@ Python code!
 
 ```toml
 ...
-
+# Note you can specify the entrypoint as a single string!
 [cerebrium.runtime.custom]
-entrypoint = [
-  'vllm',
-  'serve',
-  'meta-llama/Meta-Llama-3-8B-Instruct',
-  '--host',
-  '0.0.0.0',
-  '--port',
-  '8000',
-  '--device',
-  'cuda',
-  '--dtype',
-  'auto',
-  '--trust-remote-code',
-  '--enable-chunked-prefill',
-  '--disable-fastapi-docs',
-  '--max_num_batched_tokens',
-  '1024'
-]
+entrypoint = "vllm serve meta-llama/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --device cuda"
 port = 8000
 healthcheck_endpoint = "/health"
 

From 8227c130b4e3e90b8487a29aa697270f1e882477 Mon Sep 17 00:00:00 2001
From: Elijah Roussos <elijah.rou@gmail.com>
Date: Wed, 23 Oct 2024 22:25:00 -0400
Subject: [PATCH 5/6] fix typos and add info

---
 cerebrium/environments/custom-runtime.mdx | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/cerebrium/environments/custom-runtime.mdx b/cerebrium/environments/custom-runtime.mdx
index 098b34b0..36c1461b 100644
--- a/cerebrium/environments/custom-runtime.mdx
+++ b/cerebrium/environments/custom-runtime.mdx
@@ -13,23 +13,25 @@ from fastapi import FastAPI
 
 server = FastAPI()
 
+# This function would map to a request to api.cortex.cerebrium.ai/project-id/app-name/hello
 @server.get("/hello")
 async def hello():
     return {"message": "Hello Cerebrium!"}
 
+# You must define an endpoint that can relay to Cerebrium that the app is ready to receive requests
 @server.get("/health")
-asycn def health():
+async def health():
     return "Ok"
 ```
 
 To enable us to deploy this application, we modify our `cerebrium.toml` with a 'cerebrium.runtime.custom' section.
 There are 3 parameters in this section:
 
-| parameter              | description                                                                         | type       | default                                                                |
-| ---------------------- | ----------------------------------------------------------------------------------- | ---------- | ---------------------------------------------------------------------- |
-| `entrypoint`           | The command used to enter your application as a list of strings, run from `/cortex` | list\[str] | \["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8000"] |
-| `port`                 | The port your application runs on                                                   | int        | 8000                                                                   |
-| `healthcheck_endpoint` | The endpoint the application uses to relay that it is ready to receive requests.    | string     | "/readyz"                                                              |
+| parameter              | description                                                                                                                                                                                      | type       | default                                                                |
+| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | ---------------------------------------------------------------------- |
+| `entrypoint`           | The command used to enter your application as either a list of strings or a single string. This is run from the `/cortex` directory                                                              | list\[str] | \["uvicorn", "app.main:server", "--host", "0.0.0.0", "--port", "8000"] |
+| `port`                 | The port your application runs on. You must ensure this port is the same your app exposes and expects to receive traffic on                                                                      | int        | 8000                                                                   |
+| `healthcheck_endpoint` | The endpoint the application uses to relay that it is ready to receive requests. A _200_ response is required from the endpoint in order for Cerebrium to know when the app can receive requests | string     | "/readyz"                                                              |
 
 An example of a config section for a custom runtime for our main file may look something like this:
 

From 62cfc14c463a07f04262b1bd3865dd0c047fc88d Mon Sep 17 00:00:00 2001
From: Elijah Roussos <elijah.rou@gmail.com>
Date: Wed, 23 Oct 2024 22:33:22 -0400
Subject: [PATCH 6/6] top note

---
 cerebrium/environments/custom-runtime.mdx | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/cerebrium/environments/custom-runtime.mdx b/cerebrium/environments/custom-runtime.mdx
index 36c1461b..3b2883fa 100644
--- a/cerebrium/environments/custom-runtime.mdx
+++ b/cerebrium/environments/custom-runtime.mdx
@@ -3,6 +3,19 @@ title: Using Custom Runtimes (Beta)
 description: Configure custom ASGI or WSGI runtimes
 ---
 
+<Note>
+  This is a new feature! As such, the API is still currently subject to changes.
+
+Most applications are expected to work with the current implementation.
+However, should you encounter an issue deploying a Custom Runtime please
+reach out to us on Discord!
+
+Still on the way:
+
+- Websocket support
+- Healthcheck grace period to prevent takedown of healthy app that registers as unhealthy
+  </Note>
+
 The default Cortex runtime can be great for getting up and running and simple use cases. However, you may already have an application built, or need
 more complex functionality built into your app such as custom authentication, dynamic batching, public endpoints or websockets.
 The Cerebrium platform allows you to deploy a custom python-based runtime to achieve this. To illustrate how this works, let's