From 130c6c23419d4b7df282bc41e9def6d6d32cd4b2 Mon Sep 17 00:00:00 2001 From: Kyle Gani Date: Thu, 12 Dec 2024 12:57:19 +0200 Subject: [PATCH 1/6] Updated: Layout and added missing docs --- available-hardware.mdx | 135 ------------------ .../persistent-storage.mdx | 99 ------------- cerebrium/endpoints/async.mdx | 4 +- cerebrium/endpoints/custom-web-servers.mdx | 50 +++++++ cerebrium/endpoints/inference-api.mdx | 2 +- cerebrium/getting-started/collaborating.mdx | 36 +++++ .../fast-deployments-dos-and-donts.mdx | 0 .../faster-model-loading.mdx | 0 .../fitting-large-models-on-small-gpus.mdx | 0 .../prebuilt-models.mdx} | 0 .../{misc => other-topics}/using-secrets.mdx | 0 cerebrium/scaling/batching-concurrency.mdx | 75 ++++++++++ cerebrium/scaling/scaling-apps.mdx | 62 ++++++++ cerebrium/storage/managing-files.mdx | 77 ++++++++++ mint.json | 39 ++--- reliability-scalability.mdx | 32 ----- 16 files changed, 324 insertions(+), 287 deletions(-) delete mode 100644 available-hardware.mdx delete mode 100644 cerebrium/data-sharing-storage/persistent-storage.mdx create mode 100644 cerebrium/endpoints/custom-web-servers.mdx create mode 100644 cerebrium/getting-started/collaborating.mdx rename cerebrium/{faqs-and-help => other-topics}/fast-deployments-dos-and-donts.mdx (100%) rename cerebrium/{misc => other-topics}/faster-model-loading.mdx (100%) rename cerebrium/{faqs-and-help => other-topics}/fitting-large-models-on-small-gpus.mdx (100%) rename cerebrium/{prebuilt-models/introduction.mdx => other-topics/prebuilt-models.mdx} (100%) rename cerebrium/{misc => other-topics}/using-secrets.mdx (100%) create mode 100644 cerebrium/scaling/batching-concurrency.mdx create mode 100644 cerebrium/scaling/scaling-apps.mdx create mode 100644 cerebrium/storage/managing-files.mdx delete mode 100644 reliability-scalability.mdx diff --git a/available-hardware.mdx b/available-hardware.mdx deleted file mode 100644 index a38e78cf..00000000 --- a/available-hardware.mdx +++ /dev/null @@ -1,135 +0,0 @@ ---- -title: "Available Hardware" -description: "A list of hardware that is available on Cerebrium's platform." ---- - -The Cerebrium platform allows you to quickly and easily deploy machine learning workloads on a variety of different hardware. -We take care of all the hard work so you don't have to. We manage everything from the hardware drivers to the scaling of your deployments so that you can focus on what matters most: your use case. - -This page lists the hardware that is currently available on the platform. If you would like us to support additional hardware options on the platform please reach out to [Support](mailto:support@cerebrium.ai) - -# Hardware - -## GPUs - -We have the following graphics cards available on the platform. The Cerebrium identifier (used in your config file), -is usually a combination of the *GPU generation* and *model name* to avoid ambiguity: - -| GPU Model | Cerebrium Identifier | VRAM | Minimum Plan | Provider -| --------------------------------------------------------------------------------------------------- | :------: |------ | :-------------------: | :-------------------: | -| [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) | HOPPER_H100 | 80GB | Enterprise | [AWS] -| [NVIDIA A100_80GB](https://www.nvidia.com/en-us/data-center/a100/) | AMPERE_A100 | 80GB | Enterprise | [AWS] -| [NVIDIA A100_40GB](https://www.nvidia.com/en-us/data-center/a100/) | AMPERE_A100_40GB | 40GB | Enterprise | [AWS] -| [NVIDIA A10](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) | AMPERE_A10 | 24GB | Hobby | [AWS] -| [NVIDIA L40s](https://www.nvidia.com/en-us/data-center/l40s/) | ADA_L40 | 48GB | Hobby | [AWS] -| [NVIDIA L4](https://www.nvidia.com/en-us/data-center/l4/) | ADA_L4 | 24GB | Hobby | [AWS] -| [NVIDIA T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) | TURING_T4 | 16GB | Hobby | [AWS] -| [AWS INFERENTIA](https://aws.amazon.com/machine-learning/inferentia/) | INF2 | 32GB | Hobby | [AWS] -| [AWS TRANIUM](https://aws.amazon.com/machine-learning/trainium/) | TRN1 | 32GB | Hobby | [AWS] - -These GPUs can be selected using the `--gpu` flag when deploying your app on Cortex or can be specified in your `cerebrium.toml`. -For more help with deciding which GPU you require, see this section [here](#choosing-a-gpu). - -## CPUs - -We select the CPU based on your choice of hardware, choosing the best available options so you can get the performance you need. - -You can choose the number of CPU cores you require for your deployment. If you don't need the cores, choose how many you require and pay for only what you need! - -## CPU Memory - -We let you select the amount of memory you require for your deployment. -All the memory you request is dedicated to your deployment and is not shared with any other deployments, ensuring that you get the performance you need. -This is the amount of memory that is available to your code when it is running, and you should choose an adequate amount for your model to be loaded into VRAM if you are deploying onto a GPU. -Once again, you only pay for what you need! - -## Storage - -We provide you with a persistent storage volume attached to your deployment. -You can use this storage volume to store any data that you need to persist between deployments. Accessing your persistent storage is covered in depth for [cortex here](/cerebrium/data-sharing-storage/persistent-storage). - -The storage volume is backed by high-performance SSDs so that you can get the best performance possible -Pricing for storage is based on the amount of storage you use and is charged per GB per month. - -# Determine your Hardware Requirements - -Deciding which hardware you require for your deployment can be a daunting task. -On one hand, you want the best performance possible, but on the other hand, you don't want to pay for more resources than you need. - -## Choosing a GPU - -Choosing a GPU can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your app which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs. - -As a rule of thumb, the easiest way is to choose the GPU that has at least 1.5x the minimum amount of VRAM that your model requires. -This approach is conservative and will ensure that your model will fit on the GPU you choose even if you have longer inputs than you expect. However, it's just a rule of thumb, and you should test the VRAM usage of your model to ensure that it will fit on the GPU you choose. - -You can calculate the VRAM usage of your model by using the following formula: - -```python -modelVRAM = numParams x numBytesPerDataType -``` - -For example, if you have a model that is 7B parameters, and you decide to use 32-bit Floating point precision, you can calculate the VRAM usage as follows: - -```python -modelVRAM = 7B x 4 = 28GB -``` - -When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the GPU you choose. - -Alternatively, if you were happy with the slight precision penalty of using quantisation, your model would have required 7GB of VRAM for 8-bit quantisation. So you could have chosen a GPU with 16GB of VRAM. This is the approach we recommend, especially with large models (>20B parameters) as the precision penalty is minimal and your cost savings are substantial. - - - Pro tip: The precision loss from quantisation is negligible in comparison to - the performance gains you get from the larger model that can fit on the same - GPU. - - -## Setting your number of CPU Cores - -If you are unsure of how many CPU cores you need, it's best to start with either 2 or 4 cores and scale up if you find you need more. For most use cases, 4 cores are sufficient. - -## Calculating your required CPU Memory - -A guideline for calculating the amount of CPU memory that you require is to use the same amount of CPU memory as you have VRAM. This is because the model will generally be loaded into CPU memory first during initialisation. This happens while the weights are being loaded from storage and the model is being compiled. Once the model is compiled, it is loaded into VRAM and the CPU memory is freed up. -Certain libraries, such as *transformers*, do have methods that allow you to use less CPU memory by setting flags such as `low_cpu_mem_usage` which trade off CPU memory for longer initialisation times. - -## Storage Requirements - -Storage is calculated on an ad-hoc basis and is calculated as you use it. -This storage is persistent and will be available to you for as long as you need it. - -# Advanced Parameters - -This section is for those users that have a large volume of requests and want to optimise their deployments for cost and performance. -The following parameters are available to you when deploying your model on Cortex. - -## Setting your minimum number of instances after a deployment - -When a normal deployment starts, we start with a single instance of your deployment and scale this up & down as required. This instance is used to handle all the requests that come in. If the single instance is inadequate, we will scale up your deployment to handle the load automatically. -This approach allows us to keep your costs as low as possible while still providing you with the performance you need. -However, if you have sustained, high-volume (1000s of requests per second) you can set the scale of your deployment using a `min_replicas` parameter so you don't incur a cold-start/startup time. - -This lets you start your deployment with a minimum number of instances. This means, you skip the scaling up of your deployment and jump straight to the performance you need. -For example, if you set `min_replicas=4`, your deployment will start with 4 instances and will scale up from there if required. - -Note that the number of instances you can have running simultaneously for each of your models is determined by the subscription plan that you are on. - -## Calculating your minimum number of instances - -If you know the volume of requests you are expecting as well as the amount of time each request takes (in seconds), you can calculate the minimum number of instances you require as follows: - -```python -minReplicas = ceil((requestsPerMin * requestDuration) ÷ 60) -``` - -So, if you're receiving 10 000 requests\min and each request takes 0.5s, you can calculate the minimum number of instances you require as follows: - -```python -minReplicas = ceil((10 000 * 0.5) ÷ 60) -# minReplicas = 84 -``` - -## Setting your maximum number of instances - -You may want to limit the number of instances, in this case, you can limit the number of instances that your deployment can scale up to using the `max_replicas` parameter. diff --git a/cerebrium/data-sharing-storage/persistent-storage.mdx b/cerebrium/data-sharing-storage/persistent-storage.mdx deleted file mode 100644 index da4af03a..00000000 --- a/cerebrium/data-sharing-storage/persistent-storage.mdx +++ /dev/null @@ -1,99 +0,0 @@ ---- -title: "Persistent Volumes" ---- - -Cerebrium gives you access to persistent volumes to store model weights and files. -This volume persists across your project, meaning that if -you refer to model weights or files created in a different app (but in the same project), you're able to access them. - -This allows model weights to be loaded in more efficiently, as well as reduce the size of your App container image. - -### How it works - -Every Cerebrium Project comes with a 50GB volume by default. This volume is mounted on all apps as `/persistent-storage`. - -### Uploading files - -To upload files to your persistent volume, you can use the `cerebrium cp local_path dest_path` command. This command copies files from your local machine to the specified destination path in the volume. The dest_path is optional; if not provided, the files will be uploaded to the root of the persistent volume. - -```bash -Usage: cerebrium cp [OPTIONS] LOCAL_PATH REMOTE_PATH (Optional) - - Copy contents to persistent volume. - -Options: - -h, --help Show this message and exit. - -Examples: - # Copy a single file - cerebrium cp src_file_name.txt # copies to /src_file_name.txt - - cerebrium cp src_file_name.txt dest_file_name.txt # copies to /dest_file_name.txt - - # Copy a directory - cerebrium cp dir_name # copies to the root directory - cerebrium cp dir_name sub_folder/ # copies to sub_folder/ -``` - -### Listing files - -To list the files on your persistent volume, you can use the cerebrium ls [remote_path] command. This command lists all files and directories within the specified remote_path. If no remote_path is provided, it lists the contents of the root directory of the persistent volume. - -```bash -Usage: cerebrium ls [OPTIONS] REMOTE_PATH (Optional) - - List contents of persistent volume. - -Options: - -h, --help Show this message and exit. - -Examples: - # List all files in the root directory - cerebrium ls - - # List all files in a specific folder - cerebrium ls sub_folder/ -``` - -### Deleting files - -To delete files or directories from your persistent volume, use the `cerebrium rm remote_path` command. This command removes the specified file or directory from the persistent volume. Be careful, as this operation is irreversible. - -```bash -Usage: cerebrium rm [OPTIONS] REMOTE_PATH - - Remove a file or directory from persistent volume. - -Options: - -h, --help Show this message and exit. - -Examples: - # Remove a specific file - cerebrium rm /file_name.txt - - # Remove a directory and all its contents - cerebrium rm /sub_folder/ -``` - -### Real world example - -```bash -wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -cerebrium cp sam_vit_h_4b8939.pth segment-anything/sam_vit_h_4b8939.pth -``` - -As a simple example, suppose you have an external SAM model that you want to use in your custom deployment. You can download it to a cache directory on your persistent volume. -as such: - -```python -import os -import torch - -file_path = "/persistent-storage/segment-anything/sam_vit_h_4b8939.pth" - -# Load the model -model = torch.jit.load(file_path) -... # Continue with your initialization -``` - -Now, in later inference requests, the model loads from the persistent volume instead of downloading again. diff --git a/cerebrium/endpoints/async.mdx b/cerebrium/endpoints/async.mdx index 34bc7bbd..ec5bd233 100644 --- a/cerebrium/endpoints/async.mdx +++ b/cerebrium/endpoints/async.mdx @@ -1,6 +1,6 @@ --- -title: "Asynchronous Execution (Preview)" -description: "Execute calls to a Cerebrium app to be run asynchroously" +title: "Asynchronous Execution" +description: "Execute calls to a Cerebrium app to be run asynchronously" --- diff --git a/cerebrium/endpoints/custom-web-servers.mdx b/cerebrium/endpoints/custom-web-servers.mdx new file mode 100644 index 00000000..c3690123 --- /dev/null +++ b/cerebrium/endpoints/custom-web-servers.mdx @@ -0,0 +1,50 @@ +--- +title: "Custom Web Servers" +description: "Run ASGI/WSGI python apps on Cerebrium" +--- + +While Cerebrium's default runtime works well for most app needs, teams sometimes need more control over their web server implementation. Using ASGI or WSGI servers through Cerebrium's custom runtime feature enables capabilities like custom authentication, dynamic batching, frontend dashboards, public endpoints, and websockets. + +## Setting Up Custom Servers + +Here's a simple FastAPI server implementation that shows how custom servers work in Cerebrium: + +```python +from fastapi import FastAPI + +server = FastAPI() + +@server.post("/hello") +def hello(): + return {"message": "Hello Cerebrium!"} + +@server.get("/health") +def health(): + return "Ok" +``` + +Configure this server in `cerebrium.toml` by adding a custom runtime section: + +```toml +[cerebrium.runtime.custom] +port = 5000 +entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "5000"] +healthcheck_endpoint = "/health" + +[cerebrium.dependencies.pip] +pydantic = "latest" +numpy = "latest" +loguru = "latest" +fastapi = "latest" +``` + +The configuration requires three key parameters: +- `entrypoint`: The command that starts your server +- `port`: The port your server listens on +- `healthcheck_endpoint`: The endpoint that confirms server health + + +For ASGI applications like FastAPI, include the appropriate server package (like `uvicorn`) in your dependencies. After deployment, your endpoints become available at `https://api.cortex.cerebrium.ai/v4/{project-id}/{app-name}/your/endpoint`. + + +Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation. \ No newline at end of file diff --git a/cerebrium/endpoints/inference-api.mdx b/cerebrium/endpoints/inference-api.mdx index 08c89702..e49b90f4 100644 --- a/cerebrium/endpoints/inference-api.mdx +++ b/cerebrium/endpoints/inference-api.mdx @@ -1,5 +1,5 @@ --- -title: "Inference API" +title: "Rest API" description: "" --- diff --git a/cerebrium/getting-started/collaborating.mdx b/cerebrium/getting-started/collaborating.mdx new file mode 100644 index 00000000..a319d806 --- /dev/null +++ b/cerebrium/getting-started/collaborating.mdx @@ -0,0 +1,36 @@ +--- +title: Collaborating on Cerebrium +description: Learn how to manage your team on the platform +--- + +## Managing Team Members + +The Users section in the Cerebrium dashboard provides tools for managing project access and collaboration. From this central location, project administrators can invite team members and control their access levels through role assignments. + +## Adding Team Members + +To add new members to a project: + +1. Navigate to the Users section in the dashboard sidebar +2. Select "Invite New User" in the top right corner +3. Enter the new member's email address +4. Choose their role (Member or Billing) +5. Select "Send Invitation" + +The platform sends an email invitation automatically and tracks its status in the Users table. + +## Access Roles + +Each team member receives a specific role that defines their project access. The Member role enables teams to deploy applications and monitor their performance. For financial management, the Billing role grants additional access to payment settings and usage reporting. + +## Managing Access + +The Users table displays member details, including names, email addresses, roles, and when they joined the project. From this view, administrators can: + +1. Monitor invitation status for pending members +2. Track when members joined the project +3. View current roles and access levels +4. Adjust roles as team needs change +5. Resend invitations when needed + +Once members accept their invitations, they gain immediate access based on their assigned roles and can access their authorised project(s) from the dashboards. \ No newline at end of file diff --git a/cerebrium/faqs-and-help/fast-deployments-dos-and-donts.mdx b/cerebrium/other-topics/fast-deployments-dos-and-donts.mdx similarity index 100% rename from cerebrium/faqs-and-help/fast-deployments-dos-and-donts.mdx rename to cerebrium/other-topics/fast-deployments-dos-and-donts.mdx diff --git a/cerebrium/misc/faster-model-loading.mdx b/cerebrium/other-topics/faster-model-loading.mdx similarity index 100% rename from cerebrium/misc/faster-model-loading.mdx rename to cerebrium/other-topics/faster-model-loading.mdx diff --git a/cerebrium/faqs-and-help/fitting-large-models-on-small-gpus.mdx b/cerebrium/other-topics/fitting-large-models-on-small-gpus.mdx similarity index 100% rename from cerebrium/faqs-and-help/fitting-large-models-on-small-gpus.mdx rename to cerebrium/other-topics/fitting-large-models-on-small-gpus.mdx diff --git a/cerebrium/prebuilt-models/introduction.mdx b/cerebrium/other-topics/prebuilt-models.mdx similarity index 100% rename from cerebrium/prebuilt-models/introduction.mdx rename to cerebrium/other-topics/prebuilt-models.mdx diff --git a/cerebrium/misc/using-secrets.mdx b/cerebrium/other-topics/using-secrets.mdx similarity index 100% rename from cerebrium/misc/using-secrets.mdx rename to cerebrium/other-topics/using-secrets.mdx diff --git a/cerebrium/scaling/batching-concurrency.mdx b/cerebrium/scaling/batching-concurrency.mdx new file mode 100644 index 00000000..a378e9d6 --- /dev/null +++ b/cerebrium/scaling/batching-concurrency.mdx @@ -0,0 +1,75 @@ +--- +title: "Batching & Concurrency" +description: "Improve throughput and cost performance with batching & concurrency" +--- + +## Understanding Concurrency + +Concurrency in Cerebrium allows each instance to process multiple requests simultaneously. The `replica_concurrency` setting in the `cerebrium.toml` file determines how many requests each instance handles in parallel: + +```toml +[cerebrium.scaling] +replica_concurrency = 4 # Process up to 4 requests simultaneously +``` + +When requests arrive at an instance that hasn't reached its concurrency limit, they begin processing immediately. Once an instance reaches its maximum concurrent requests, additional requests queue until capacity becomes available. This parallel processing capability helps applications maintain consistent performance during periods of high traffic. + +Modern GPUs excel at parallel processing, making concurrent request handling particularly effective for workloads. For instance, when an instance processes multiple image classification requests concurrently, it utilizes GPU resources more efficiently than processing requests sequentially. + +## Understanding Batching + +Batching determines how concurrent requests are processed together within an instance. While concurrency controls the number of simultaneous requests, batching manages how these requests are grouped and executed. + +Cerebrium supports two approaches to request batching. + +### Framework-Native Batching + +Many frameworks include features for processing multiple requests efficiently. vLLM, for example, automatically handles batched model inference requests: + +```toml +[cerebrium.scaling] +min_replicas = 0 +max_replicas = 2 +cooldown = 10 +replica_concurrency = 4 # Each container can now handle multiple requests + +[cerebrium.dependencies.pip] +sentencepiece = "latest" +torch = "latest" +vllm = "latest" +transformers = "latest" +accelerate = "latest" +xformers = "latest" +``` + +When multiple requests arrive, vLLM automatically combines them into optimal batch sizes and processes them together, maximizing GPU utilization through its internal batching functionality. + + + Check out the complete [vLLM batching example](https://github.com/CerebriumAI/examples/tree/master/10-batching/3-vllm-batching-gpu) for more information. + + +### Custom Batching + +Applications requiring precise control over request processing can implement custom batching through Cerebrium's [custom runtime feature](/cerebrium/container-images/defining-container-images#custom-runtimes). This approach allows for specific batching strategies and custom processing logic. + +As an example, implementation with LitServe requires additional configuration in the `cerebrium.toml` file: + +```toml +[cerebrium.runtime.custom] +port = 8000 +entrypoint = ["python", "app/main.py"] +healthcheck_endpoint = "/health" + +[cerebrium.dependencies.pip] +litserve = "latest" +fastapi = "latest" +``` + + + Check out the complete [Litserve example](https://github.com/CerebriumAI/examples/tree/master/10-batching/2-litserve-batching-gpu) for more information. + + +Custom batching provides complete control over request grouping and processing, particularly valuable for frameworks without native batching support or applications with specific processing requirements. The [Container Images Guide](/cerebrium/container-images/defining-container-images#custom-runtimes) provides detailed implementation instructions. + +Together, batching and concurrency create an efficient request processing system. Concurrency enables parallel request handling, while batching optimizes how these concurrent requests are processed, leading to better resource utilization and application performance. + diff --git a/cerebrium/scaling/scaling-apps.mdx b/cerebrium/scaling/scaling-apps.mdx new file mode 100644 index 00000000..cfb23135 --- /dev/null +++ b/cerebrium/scaling/scaling-apps.mdx @@ -0,0 +1,62 @@ +--- +title: "Scaling Apps" +description: "Learn to optimise for cost and performance by scaling out apps" +--- + +Cerebrium's scaling system automatically manages computing resources to match app demand. The system handles everything from a few simple requests, to the processing of multiple requests simultaneously, while optimizing for both performance and cost. + +## How Autoscaling Works + +The scaling system monitors two key metrics to make scaling decisions: + +The **number of requests** currently waiting for processing in the queue indicates immediate demand. Additionally, the system tracks **how long each request has waited in the queue**. When either of these metrics exceeds their thresholds, new instances start within 3 seconds to handle the increased load. + +As traffic decreases, instances enter a cooldown period after processing their last request. When no new requests arrive during cooldown, instances terminate to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively. + +## Scaling Configuration + +The `cerebrium.toml` file controls scaling behavior through several key parameters: + +```toml +[cerebrium.scaling] +min_replicas = 0 # Minimum running instances +max_replicas = 3 # Maximum concurrent instances +cooldown = 60 # Cooldown period in seconds +``` + +### Minimum Instances +The `min_replicas` parameter defines how many instances remain active at all times. Setting this to 1 or higher maintains warm instances ready for immediate response, eliminating cold starts but increasing costs. This configuration suits apps that require consistent response times or need to meet specific SLA requirements. + +### Maximum Instances +The `max_replicas` parameter sets an upper limit on concurrent instances, controlling costs and protecting backend systems. When traffic increases, new instances start automatically up to this configured maximum. + +### Cooldown Period +After processing a request, instances remain available for the duration specified by `cooldown`. Each new request resets this timer. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost. + +## Processing Multiple Requests + +Apps can process multiple requests simultaneously using Cerebrium's batching and concurrency features. The system offers native support for frameworks with built-in batching capabilities and enables custom implementations through the [custom runtime](cerebrium/container-images/defining-container-images#custom-runtimes) feature. For detailed information about handling multiple requests efficiently, see our [Batching & Concurrency Guide](/cerebrium/scaling/batching-concurrency). + +## Instance Management + +Cerebrium ensures reliability through automatic instance health management. The system restarts instances that encounter issues, quickly starts new instances to maintain processing capacity, and monitors instance health continuously. + +Apps requiring maximum reliability often combine several scaling features: + +```toml +[cerebrium.scaling] +min_replicas = 2 # Maintain redundant instances +cooldown = 600 # Extended warm period +max_replicas = 10 # Room for traffic spikes +response_grace_period = 1200 # Clean shutdown time +``` + +The `response_grace_period` parameter provides time for instances to complete active requests during shutdown. The system first sends a SIGTERM signal, waits for the specified grace period, then issues a SIGKILL command if the instance hasn't stopped. + +Performance metrics available through the dashboard help monitor scaling behavior: +- Request processing times +- Active instance count +- Cold start frequency +- Resource usage patterns + +The system status and platform-wide metrics remain accessible through our [status page](https://status.cerebrium.ai), where Cerebrium maintains 99.9% uptime. \ No newline at end of file diff --git a/cerebrium/storage/managing-files.mdx b/cerebrium/storage/managing-files.mdx new file mode 100644 index 00000000..73edd54a --- /dev/null +++ b/cerebrium/storage/managing-files.mdx @@ -0,0 +1,77 @@ +--- +title: "Managing Files" +--- + + +Cerebrium offers file management through a 50GB persistent volume that's available to all applications in a project. This storage mounts at `/persistent-storage` and helps store model weights and files efficiently across deployments. + +## Including Files in Deployments + +The `cerebrium.toml` configuration file controls which files become part of the application: + +```toml +[cerebrium] +include = [ + "src/*.py", # Python files in src + "config/*.json", # JSON files in config + "requirements.txt" # Specific files +] + +exclude = [ + "tests/*", # Skip test files + "*.log" # Skip logs +] +``` + +Files included in deployments must be under 2GB each, with deployments working best for files under 1GB. Larger files should use persistent storage instead. + +## Managing Persistent Storage + +The CLI provides three commands for working with persistent storage: + +1. Upload files with `cerebrium cp`: +```bash +# Upload to root directory +cerebrium cp src_file_name.txt + +# Upload to specific location +cerebrium cp src_file_name.txt dest_file_name.txt + +# Upload entire directory +cerebrium cp dir_name sub_folder/ +``` + +2. List files with `cerebrium ls`: +```bash +# View root contents +cerebrium ls + +# View specific folder +cerebrium ls sub_folder/ +``` + +3. Remove files with `cerebrium rm`: +```bash +# Remove a file +cerebrium rm file_name.txt + +# Remove a directory +cerebrium rm sub_folder/ +``` + +## Using Stored Files + +Here's how to work with files in persistent storage: + +```python +import os +import torch + +# Load a model from persistent storage +file_path = "/persistent-storage/segment-anything/sam_vit_h_4b8939.pth" +model = torch.jit.load(file_path) +``` + + + Should you require additional storage capacity, please reach out to us through [support](mailto:support@cerebrium.ai). + \ No newline at end of file diff --git a/mint.json b/mint.json index 3b4d1fa8..8bb11173 100644 --- a/mint.json +++ b/mint.json @@ -74,7 +74,10 @@ "navigation": [ { "group": "Getting Started", - "pages": ["cerebrium/getting-started/introduction"] + "pages": [ + "cerebrium/getting-started/introduction", + "cerebrium/getting-started/collaborating" + ] }, { "group": "Container Images", @@ -89,8 +92,11 @@ ] }, { - "group": "Data Storage", - "pages": ["cerebrium/data-sharing-storage/persistent-storage"] + "group": "Scaling apps", + "pages": [ + "cerebrium/scaling/scaling-apps", + "cerebrium/scaling/batching-concurrency" + ] }, { "group": "Deployments", @@ -104,32 +110,29 @@ "cerebrium/endpoints/streaming", "cerebrium/endpoints/websockets", "cerebrium/endpoints/webhook", - "cerebrium/endpoints/async" + "cerebrium/endpoints/async", + "cerebrium/endpoints/custom-web-servers" ] }, { - "group": "Integrations", - "pages": ["cerebrium/integrations/vercel"] - }, - { - "group": "Other concepts", + "group": "Storage", "pages": [ - "cerebrium/misc/faster-model-loading", - "cerebrium/misc/using-secrets" + "cerebrium/storage/managing-files" ] }, { - "group": "Prebuilt Models", - "pages": ["cerebrium/prebuilt-models/introduction"] + "group": "Integrations", + "pages": ["cerebrium/integrations/vercel"] }, { - "group": "FAQs and Tips", + "group": "Other concepts", "pages": [ "security", - "reliability-scalability", - "available-hardware", + "cerebrium/other-topics/using-secrets", + "cerebrium/other-topics/faster-model-loading", "calculating-cost", - "cerebrium/faqs-and-help/fast-deployments-dos-and-donts" + "cerebrium/other-topics/fast-deployments-dos-and-donts", + "cerebrium/other-topics/prebuilt-models" ] }, { @@ -142,7 +145,7 @@ "pages": ["v4/examples/mistral-vllm"] }, { - "group": "large Language models", + "group": "Large Language Models", "pages": [ "v4/examples/openai-compatible-endpoint-vllm", "v4/examples/streaming-falcon-7B" diff --git a/reliability-scalability.mdx b/reliability-scalability.mdx deleted file mode 100644 index 8607e382..00000000 --- a/reliability-scalability.mdx +++ /dev/null @@ -1,32 +0,0 @@ ---- -title: "Reliability and Scalability" -description: "" ---- - -As engineers, we know how important it is for platforms providing infrastructure to remain reliable and scalable when running our clients' workloads. This is something that we don't take lightly at Cerebrium, and so we have implemented several processes and tooling to manage this as effectively as possible. - -## How does Cerebrium achieve reliability and scalability? - -- **Automatic scaling:** - -Cerebrium automatically scales instances based on the number of events in the queue, and the time events are in the queue. -This ensures that Cerebrium can handle even the most demanding workloads. If one of either of these two conditions are met, additional workers are spun up in < 3 seconds -to handle the volume. Once workers fall below a certain utilization level, we start decreasing the number of workers. - -In terms of the scale we are able to handle, Cerebrium has customers running at 120 transactions per second, but we can do more than that :) - -- **Fault tolerance and High availability:** - -If an instance heads into a bad state due to memory or processing issues, Cerebrium automatically restarts a new instance to handle the incoming load. -This ensures that Cerebrium can continue to process events without user intervention. If you would like to be notified via email of any problems with your model -you can toggle a switch in the top right corner of your model page - its right above your model stats. - -- **Monitoring:** - -Cerebrium is monitored 24/7 and has a globally distributed team which allows us to quickly identify and fix any problems that may arise at any time during the day. -Regardless of the severity of the incident, we hope to get things fixed as quickly as possible. Customers can monitor our uptime on our status page -here: https://status.cerebrium.ai. We strive for an uptime greater than 99.99%. - -## How can I get help if Cerebrium is not working? - -If Cerebrium is not working, you can contact our support team [here](https://cerebrium.ai/contact) or message on our Discord or Slack communities. We will work with you to resolve the issue as quickly as possible. From 161fc4dac24065ba0244762449d81531b420cc35 Mon Sep 17 00:00:00 2001 From: kylegani Date: Thu, 12 Dec 2024 10:58:20 +0000 Subject: [PATCH 2/6] Prettified Code! --- cerebrium/endpoints/custom-web-servers.mdx | 8 ++++++-- cerebrium/getting-started/collaborating.mdx | 2 +- cerebrium/scaling/batching-concurrency.mdx | 9 ++++++--- cerebrium/scaling/scaling-apps.mdx | 6 +++++- cerebrium/storage/managing-files.mdx | 9 ++++++--- mint.json | 4 +--- 6 files changed, 25 insertions(+), 13 deletions(-) diff --git a/cerebrium/endpoints/custom-web-servers.mdx b/cerebrium/endpoints/custom-web-servers.mdx index c3690123..9726e539 100644 --- a/cerebrium/endpoints/custom-web-servers.mdx +++ b/cerebrium/endpoints/custom-web-servers.mdx @@ -39,12 +39,16 @@ fastapi = "latest" ``` The configuration requires three key parameters: + - `entrypoint`: The command that starts your server - `port`: The port your server listens on - `healthcheck_endpoint`: The endpoint that confirms server health -For ASGI applications like FastAPI, include the appropriate server package (like `uvicorn`) in your dependencies. After deployment, your endpoints become available at `https://api.cortex.cerebrium.ai/v4/{project-id}/{app-name}/your/endpoint`. + For ASGI applications like FastAPI, include the appropriate server package + (like `uvicorn`) in your dependencies. After deployment, your endpoints become + available at `https://api.cortex.cerebrium.ai/v4/{project - id}/{app - name} + /your/endpoint`. -Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation. \ No newline at end of file +Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation. diff --git a/cerebrium/getting-started/collaborating.mdx b/cerebrium/getting-started/collaborating.mdx index a319d806..fb4c4958 100644 --- a/cerebrium/getting-started/collaborating.mdx +++ b/cerebrium/getting-started/collaborating.mdx @@ -33,4 +33,4 @@ The Users table displays member details, including names, email addresses, roles 4. Adjust roles as team needs change 5. Resend invitations when needed -Once members accept their invitations, they gain immediate access based on their assigned roles and can access their authorised project(s) from the dashboards. \ No newline at end of file +Once members accept their invitations, they gain immediate access based on their assigned roles and can access their authorised project(s) from the dashboards. diff --git a/cerebrium/scaling/batching-concurrency.mdx b/cerebrium/scaling/batching-concurrency.mdx index a378e9d6..a157b041 100644 --- a/cerebrium/scaling/batching-concurrency.mdx +++ b/cerebrium/scaling/batching-concurrency.mdx @@ -45,7 +45,9 @@ xformers = "latest" When multiple requests arrive, vLLM automatically combines them into optimal batch sizes and processes them together, maximizing GPU utilization through its internal batching functionality. - Check out the complete [vLLM batching example](https://github.com/CerebriumAI/examples/tree/master/10-batching/3-vllm-batching-gpu) for more information. + Check out the complete [vLLM batching + example](https://github.com/CerebriumAI/examples/tree/master/10-batching/3-vllm-batching-gpu) + for more information. ### Custom Batching @@ -66,10 +68,11 @@ fastapi = "latest" ``` - Check out the complete [Litserve example](https://github.com/CerebriumAI/examples/tree/master/10-batching/2-litserve-batching-gpu) for more information. + Check out the complete [Litserve + example](https://github.com/CerebriumAI/examples/tree/master/10-batching/2-litserve-batching-gpu) + for more information. Custom batching provides complete control over request grouping and processing, particularly valuable for frameworks without native batching support or applications with specific processing requirements. The [Container Images Guide](/cerebrium/container-images/defining-container-images#custom-runtimes) provides detailed implementation instructions. Together, batching and concurrency create an efficient request processing system. Concurrency enables parallel request handling, while batching optimizes how these concurrent requests are processed, leading to better resource utilization and application performance. - diff --git a/cerebrium/scaling/scaling-apps.mdx b/cerebrium/scaling/scaling-apps.mdx index cfb23135..fee17ab6 100644 --- a/cerebrium/scaling/scaling-apps.mdx +++ b/cerebrium/scaling/scaling-apps.mdx @@ -25,12 +25,15 @@ cooldown = 60 # Cooldown period in seconds ``` ### Minimum Instances + The `min_replicas` parameter defines how many instances remain active at all times. Setting this to 1 or higher maintains warm instances ready for immediate response, eliminating cold starts but increasing costs. This configuration suits apps that require consistent response times or need to meet specific SLA requirements. ### Maximum Instances + The `max_replicas` parameter sets an upper limit on concurrent instances, controlling costs and protecting backend systems. When traffic increases, new instances start automatically up to this configured maximum. ### Cooldown Period + After processing a request, instances remain available for the duration specified by `cooldown`. Each new request resets this timer. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost. ## Processing Multiple Requests @@ -54,9 +57,10 @@ response_grace_period = 1200 # Clean shutdown time The `response_grace_period` parameter provides time for instances to complete active requests during shutdown. The system first sends a SIGTERM signal, waits for the specified grace period, then issues a SIGKILL command if the instance hasn't stopped. Performance metrics available through the dashboard help monitor scaling behavior: + - Request processing times - Active instance count - Cold start frequency - Resource usage patterns -The system status and platform-wide metrics remain accessible through our [status page](https://status.cerebrium.ai), where Cerebrium maintains 99.9% uptime. \ No newline at end of file +The system status and platform-wide metrics remain accessible through our [status page](https://status.cerebrium.ai), where Cerebrium maintains 99.9% uptime. diff --git a/cerebrium/storage/managing-files.mdx b/cerebrium/storage/managing-files.mdx index 73edd54a..4582164a 100644 --- a/cerebrium/storage/managing-files.mdx +++ b/cerebrium/storage/managing-files.mdx @@ -2,7 +2,6 @@ title: "Managing Files" --- - Cerebrium offers file management through a 50GB persistent volume that's available to all applications in a project. This storage mounts at `/persistent-storage` and helps store model weights and files efficiently across deployments. ## Including Files in Deployments @@ -30,6 +29,7 @@ Files included in deployments must be under 2GB each, with deployments working b The CLI provides three commands for working with persistent storage: 1. Upload files with `cerebrium cp`: + ```bash # Upload to root directory cerebrium cp src_file_name.txt @@ -42,6 +42,7 @@ cerebrium cp dir_name sub_folder/ ``` 2. List files with `cerebrium ls`: + ```bash # View root contents cerebrium ls @@ -51,6 +52,7 @@ cerebrium ls sub_folder/ ``` 3. Remove files with `cerebrium rm`: + ```bash # Remove a file cerebrium rm file_name.txt @@ -73,5 +75,6 @@ model = torch.jit.load(file_path) ``` - Should you require additional storage capacity, please reach out to us through [support](mailto:support@cerebrium.ai). - \ No newline at end of file + Should you require additional storage capacity, please reach out to us through + [support](mailto:support@cerebrium.ai). + diff --git a/mint.json b/mint.json index 8bb11173..04471133 100644 --- a/mint.json +++ b/mint.json @@ -116,9 +116,7 @@ }, { "group": "Storage", - "pages": [ - "cerebrium/storage/managing-files" - ] + "pages": ["cerebrium/storage/managing-files"] }, { "group": "Integrations", From 3e923d3e2c18b13442ccfbd42d2b2bf992832dc7 Mon Sep 17 00:00:00 2001 From: Kyle Gani Date: Thu, 12 Dec 2024 20:27:26 +0200 Subject: [PATCH 3/6] Updated: Based on feedback --- cerebrium/endpoints/async.mdx | 2 +- cerebrium/endpoints/custom-web-servers.mdx | 6 +++--- cerebrium/scaling/batching-concurrency.mdx | 2 +- cerebrium/scaling/scaling-apps.mdx | 4 ++++ 4 files changed, 9 insertions(+), 5 deletions(-) diff --git a/cerebrium/endpoints/async.mdx b/cerebrium/endpoints/async.mdx index ec5bd233..f92637c7 100644 --- a/cerebrium/endpoints/async.mdx +++ b/cerebrium/endpoints/async.mdx @@ -1,5 +1,5 @@ --- -title: "Asynchronous Execution" +title: "Async requests" description: "Execute calls to a Cerebrium app to be run asynchronously" --- diff --git a/cerebrium/endpoints/custom-web-servers.mdx b/cerebrium/endpoints/custom-web-servers.mdx index 9726e539..4acc3f26 100644 --- a/cerebrium/endpoints/custom-web-servers.mdx +++ b/cerebrium/endpoints/custom-web-servers.mdx @@ -12,13 +12,13 @@ Here's a simple FastAPI server implementation that shows how custom servers work ```python from fastapi import FastAPI -server = FastAPI() +app = FastAPI() -@server.post("/hello") +@app.post("/hello") def hello(): return {"message": "Hello Cerebrium!"} -@server.get("/health") +@app.get("/health") def health(): return "Ok" ``` diff --git a/cerebrium/scaling/batching-concurrency.mdx b/cerebrium/scaling/batching-concurrency.mdx index a157b041..35862be6 100644 --- a/cerebrium/scaling/batching-concurrency.mdx +++ b/cerebrium/scaling/batching-concurrency.mdx @@ -18,7 +18,7 @@ Modern GPUs excel at parallel processing, making concurrent request handling par ## Understanding Batching -Batching determines how concurrent requests are processed together within an instance. While concurrency controls the number of simultaneous requests, batching manages how these requests are grouped and executed. +Batching determines how concurrent requests are processed together within an instance. While concurrency controls the number of simultaneous requests, batching manages how these requests are grouped and executed (The default concurrency is 1 request per container). Cerebrium supports two approaches to request batching. diff --git a/cerebrium/scaling/scaling-apps.mdx b/cerebrium/scaling/scaling-apps.mdx index fee17ab6..5908818d 100644 --- a/cerebrium/scaling/scaling-apps.mdx +++ b/cerebrium/scaling/scaling-apps.mdx @@ -11,6 +11,10 @@ The scaling system monitors two key metrics to make scaling decisions: The **number of requests** currently waiting for processing in the queue indicates immediate demand. Additionally, the system tracks **how long each request has waited in the queue**. When either of these metrics exceeds their thresholds, new instances start within 3 seconds to handle the increased load. + + Scaling is also configurable based on the expected traffic of an application. See below for more information. + + As traffic decreases, instances enter a cooldown period after processing their last request. When no new requests arrive during cooldown, instances terminate to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively. ## Scaling Configuration From 83cb5a4ebfb0d4e2fe6720b668df74cd349b29b0 Mon Sep 17 00:00:00 2001 From: kylegani Date: Thu, 12 Dec 2024 18:28:06 +0000 Subject: [PATCH 4/6] Prettified Code! --- cerebrium/scaling/scaling-apps.mdx | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/cerebrium/scaling/scaling-apps.mdx b/cerebrium/scaling/scaling-apps.mdx index 5908818d..9e029717 100644 --- a/cerebrium/scaling/scaling-apps.mdx +++ b/cerebrium/scaling/scaling-apps.mdx @@ -12,7 +12,8 @@ The scaling system monitors two key metrics to make scaling decisions: The **number of requests** currently waiting for processing in the queue indicates immediate demand. Additionally, the system tracks **how long each request has waited in the queue**. When either of these metrics exceeds their thresholds, new instances start within 3 seconds to handle the increased load. - Scaling is also configurable based on the expected traffic of an application. See below for more information. + Scaling is also configurable based on the expected traffic of an application. + See below for more information. As traffic decreases, instances enter a cooldown period after processing their last request. When no new requests arrive during cooldown, instances terminate to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively. From def77e47799f724388fd372bf2c437c9e21cf55d Mon Sep 17 00:00:00 2001 From: Kyle Gani Date: Thu, 12 Dec 2024 20:29:38 +0200 Subject: [PATCH 5/6] Removed: Prebuilt models page --- cerebrium/other-topics/prebuilt-models.mdx | 20 -------------------- mint.json | 1 - 2 files changed, 21 deletions(-) delete mode 100644 cerebrium/other-topics/prebuilt-models.mdx diff --git a/cerebrium/other-topics/prebuilt-models.mdx b/cerebrium/other-topics/prebuilt-models.mdx deleted file mode 100644 index ab820ef9..00000000 --- a/cerebrium/other-topics/prebuilt-models.mdx +++ /dev/null @@ -1,20 +0,0 @@ ---- -title: "Deploying Prebuilt Models" -description: "Cerebrium provides prebuilt models that you can deploy to an API seamlessly" ---- - -Cerebrium and its community keep a library of popular pre-built models that you can deploy using one click. If you would like any pre-built models added you can: - -- [Submit PR on GitHub](https://github.com/CerebriumAI/cerebrium-prebuilts) All our prebuilt models live here so submit a PR or one you would like to contribute to the community. Instructions in the README :) -- [Contact](mailto:support@cerebrium.ai) the Cerebrium team and we will see what we can do - -You can navigate to the [Cerebrium Prebuilts GitHub](https://github.com/CerebriumAI/cerebrium-prebuilts) where you can find the source code for each of the models. You can then clone these -repositories as a starting point. - -Each model's folder is a cortex deployment that can be deployed using the `cerebrium deploy` command. Navigate to the folder of the model you would like to deploy and run the command. - -```bash -cerebrium deploy <> -``` - -Check out the available models through your Cerebrium dashboard or by reading our docs in the **Prebuilt Models** tab! diff --git a/mint.json b/mint.json index 04471133..1c96e3d0 100644 --- a/mint.json +++ b/mint.json @@ -130,7 +130,6 @@ "cerebrium/other-topics/faster-model-loading", "calculating-cost", "cerebrium/other-topics/fast-deployments-dos-and-donts", - "cerebrium/other-topics/prebuilt-models" ] }, { From cbc1cbfe32df6d99068b6f3488397624876fb91c Mon Sep 17 00:00:00 2001 From: kylegani Date: Thu, 12 Dec 2024 18:30:32 +0000 Subject: [PATCH 6/6] Prettified Code! --- mint.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mint.json b/mint.json index 1c96e3d0..fe3daa7f 100644 --- a/mint.json +++ b/mint.json @@ -129,7 +129,7 @@ "cerebrium/other-topics/using-secrets", "cerebrium/other-topics/faster-model-loading", "calculating-cost", - "cerebrium/other-topics/fast-deployments-dos-and-donts", + "cerebrium/other-topics/fast-deployments-dos-and-donts" ] }, {