Skip to content

Commit

Permalink
Merge branch 'master' into hotfix/disable_deployment_confirmation
Browse files Browse the repository at this point in the history
# Conflicts:
#	cerebrium/environments/initial-setup.mdx
#	examples/logo-controlnet.mdx
#	v4/examples/sdxl.mdx
#	v4/examples/transcribe-whisper.mdx
  • Loading branch information
jonoirwinrsa committed Oct 25, 2024
2 parents a50e2b6 + af39641 commit 5a6c5b2
Show file tree
Hide file tree
Showing 66 changed files with 3,768 additions and 1,117 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Welcome to Cerebrium's documentation hub currently available at [docs.cerebrium.ai](https://docs.cerebrium.ai)

Cerebrium is an AWS Sagemaker alternative providing all the features you need to quickly build an ML product.
Cerebrium is an AWS SageMaker alternative providing all the features you need to quickly build an ML product.

### 🚀 Setup

Expand All @@ -26,7 +26,7 @@ yarn installed already run `npm install --global yarn` in your terminal.

### 😎 Publishing Changes

Changes will be deployed to production automatically after pushing to the default (`master`) branch.
Changes are deployed to production automatically after pushing to the default (`master`) branch.

You can also preview changes using PRs, which generates a preview link of the docs.

Expand Down
42 changes: 20 additions & 22 deletions available-hardware.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,26 @@ This page lists the hardware that is currently available on the platform. If you

# Hardware

## GPU's
## GPUs

We have the following graphics cards available on the platform:
| Name | Cerebrium Name | VRAM | Minimum Plan | Max fp32 Model Params | Max fp16 Model Params
| --------------------------------------------------------------------------------------------------- | :------: |------ | :-------------------: | :-------------------: | :------------------: |
| [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) | Special Request | 80GB | Enterprise | 18B | 36B
| [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) | Special Request | 80GB | Standard | 18B | 36B
| [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) | AMPERE_A100 | 40GB | Standard | 9B | 18B
| [NVIDIA RTX A6000](https://www.nvidia.com/en-us/design-visualization/rtx-a6000/) | AMPERE_A6000 | 48GB | Hobby | 10B | 21B
| [NVIDIA RTX A5000](https://www.nvidia.com/en-us/design-visualization/rtx-a5000/) | AMPERE_A5000 | 24GB | Hobby | 5B | 10B
| [NVIDIA RTX A4000](https://www.nvidia.com/en-us/design-visualization/rtx-a4000/) | AMPERE_A4000 | 16GB | Hobby | 3B | 7B
| [NVIDIA Quadro RTX 5000](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/quadro-rtx-5000-data-sheet-us-nvidia-704120-r4-web.pdf) | TURING_5000 | 16GB | Hobby | 3B | 7B
| [NVIDIA Quadro RTX 4000](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/quadro-rtx-4000-datasheet-us-nvidia-1060942-r2-web.pdf) | TURING_4000 | 8GB | Hobby | 1B | 3B

_NOTE: The maximum model sizes are calculated as a guideline, assuming that the model is the only thing loaded into VRAM. Longer inputs will result in a smaller maximum model size. Your mileage may vary._

These GPUs can be selected using the `--gpu` flag when deploying your model on Cortex or can be specified in your `cerebrium.toml`.
| Name | Cerebrium Name | VRAM | Minimum Plan | Provider
| --------------------------------------------------------------------------------------------------- | :------: |------ | :-------------------: | :-------------------: |
| [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) | Special Request | 80GB | Enterprise | [AWS]
| [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) | Special Request | 80GB | Enterprise | [AWS]
| [NVIDIA A100_80GB](https://www.nvidia.com/en-us/data-center/a100/) | AMPERE_A100 | 80GB | Enterprise | [AWS]
| [NVIDIA A100_40GB](https://www.nvidia.com/en-us/data-center/a100/) | AMPERE_A100 | 40GB | Enterprise | [AWS]
| [NVIDIA A10](https://www.nvidia.com/en-us/data-center/a100/) | AMPERE_A10 | 24GB | Hobby | [AWS]
| [NVIDIA L4](https://www.nvidia.com/en-us/data-center/l4/) | ADA_L4 | 24GB | Hobby | [AWS]
| [NVIDIA L40s](https://www.nvidia.com/en-us/data-center/l40s/) | ADA_L40 | 48GB | Hobby | [AWS]
| [NVIDIA T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) | TURING_T4 | 16GB | Hobby | [AWS]
| [AWS INFERENTIA](https://aws.amazon.com/machine-learning/inferentia/) | INF2 | 32GB | Hobby | [AWS]
| [AWS TRANIUM](https://aws.amazon.com/machine-learning/trainium/) | TRN1 | 32GB | Hobby | [AWS]


These GPUs can be selected using the `--gpu` flag when deploying your app on Cortex or can be specified in your `cerebrium.toml`.
For more help with deciding which GPU you require, see this section [here](#choosing-a-gpu).

_Due to the global shortage of GPUs at the moment, we may not always have the Enterprise edition of your GPU available. In this case, we will deploy to the Workstation edition of the GPU._
_These are the same GPUs, and it will not affect the performance of your model in any way._

## CPUs

We select the CPU based on your choice of hardware, choosing the best available options so you can get the performance you need.
Expand All @@ -42,13 +40,13 @@ You can choose the number of CPU cores you require for your deployment. If you d

We let you select the amount of memory you require for your deployment.
All the memory you request is dedicated to your deployment and is not shared with any other deployments, ensuring that you get the performance you need.
This is the amount of memory that is available to your code when it is running and you should choose an adequate amount for your model to be loaded into VRAM if you are deploying onto a GPU.
This is the amount of memory that is available to your code when it is running, and you should choose an adequate amount for your model to be loaded into VRAM if you are deploying onto a GPU.
Once again, you only pay for what you need!

## Storage

We provide you with a persistent storage volume attached to your deployment.
You can use this storage volume to store any data that you need to persist between deployments. Accessing your persistent storage is covered in depth for [cortex here](./cerebrium/data-sharing-storage/persistent-storage).
You can use this storage volume to store any data that you need to persist between deployments. Accessing your persistent storage is covered in depth for [cortex here](/cerebrium/data-sharing-storage/persistent-storage).

The storage volume is backed by high-performance SSDs so that you can get the best performance possible
Pricing for storage is based on the amount of storage you use and is charged per GB per month.
Expand All @@ -60,10 +58,10 @@ On one hand, you want the best performance possible, but on the other hand, you

## Choosing a GPU

Choosing a GPU can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your model which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs.
Choosing a GPU can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your app which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs.

As a rule of thumb, the easiest way is to choose the GPU that has at least 1.5x the minimum amount of VRAM that your model requires.
This approach is conservative and will ensure that your model will fit on the GPU you choose even if you have longer inputs than you expect. However, it is just a rule of thumb and you should test the VRAM usage of your model to ensure that it will fit on the GPU you choose.
This approach is conservative and will ensure that your model will fit on the GPU you choose even if you have longer inputs than you expect. However, it's just a rule of thumb, and you should test the VRAM usage of your model to ensure that it will fit on the GPU you choose.

You can calculate the VRAM usage of your model by using the following formula:

Expand Down
12 changes: 6 additions & 6 deletions calculating-cost.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,19 @@ view the pricing of various compute on our [pricing page](https://www.cerebrium.

When you deploy a model, there are two processes we charge you for:

1. We charge you for the build process where we set up your model environment. In this step, we set up a Python environment according to your parameters before downloading and installing the required apt packages, Conda and Python packages as well as any model files you require.
1. We charge you for the build process where we set up your app environment. In this step, we set up a Python environment according to your parameters before downloading and installing the required apt packages, Conda and Python packages as well as any model files you require.
You are only charged for a build if we need to rebuild your environment, ie: you have run a `build` or `deploy` command and have changed your requirements, parameters or code. Note that we cache each of the steps in a build so subsequent builds will cost substantially less than the first.

2. The model runtime. This is the amount of time it takes your code to run from start to finish on each request. There are 3 costs to consider here:
2. The app runtime. This is the amount of time it takes your code to run from start to finish on each request. There are 3 costs to consider here:

- <u>Cold start</u>: This is the amount of time it takes to spin up a server(s),
load your environment, connect storage etc. This is part of the Cerebrium
service and something we are working on every day to get as low as possible.
<b>We do not charge you for this!</b>
- <u>Model initialization</u>: This part of your code is outside of the predict
function and only runs when your model incurs a cold start. You are charged
for the amount of time it takes for this code to run. Typically this is
loading a model into GPU RAM.
function and only runs when your app incurs a cold start. You are charged for
the amount of time it takes for this code to run. Typically this is loading a
model into GPU RAM.
- <u>Predict runtime</u>: This is the code stored in your predict function and
runs every time a request hits your endpoint

Expand All @@ -34,7 +34,7 @@ The model you wish to deploy requires:
- 20GB Memory: 20 \* $0.00000659 per second
- 10 GB persistent storage: 10 \* $0.3 per month

In our situation, your model works on the first deployment and so you incur only one build process of 2 minutes. Additionally, let's say that the model has 10 cold starts a day with an average initialization of 2 seconds and lastly and average runtime (predict) of 2 seconds. Let us calculate your
In our situation, your app works on the first deployment and so you incur only one build process of 2 minutes. Additionally, let's say that the app has 10 cold starts a day with an average initialization of 2 seconds and lastly and average runtime (predict) of 2 seconds. Let us calculate your
expected cost at month end with you expecting to do 100 000 model inferences.

```python
Expand Down
108 changes: 77 additions & 31 deletions cerebrium/data-sharing-storage/persistent-storage.mdx
Original file line number Diff line number Diff line change
@@ -1,53 +1,99 @@
---
title: "Persistent Storage"
title: "Persistent Volumes"
---

Cerebrium gives to access to persistent storage to store model weights, files and much more. This storage volume persists across your project, meaning that if
you refer to model weights or a file created in a different deployment, you will be able to access it!
Cerebrium gives you access to persistent volumes to store model weights and files.
This volume persists across your project, meaning that if
you refer to model weights or files created in a different app (but in the same project), you're able to access them.

This allows you to load in model weights more efficiently as well as reduce the size of your deployment container images. Currently,
the volume can be accessed through `/persistent-storage` in your container instance, should you wish to access it directly and store other artifacts.
This allows model weights to be loaded in more efficiently, as well as reduce the size of your App container image.

While you have full access to this drive, we recommend that you only store files in directories other than `/persistent-storage/cache`, as this and its subdirectories
are used by Cerebrium to store your models. As a simple example, suppose you have an external SAM model that you want to use in your custom deployment. You can download it to the cache
as such:
### How it works

```python
import os
import torch
Every Cerebrium Project comes with a 50GB volume by default. This volume is mounted on all apps as `/persistent-storage`.

file_path = "/persistent-storage/segment-anything/model.pt"
# Check if the file already exists, if not download it
if not os.path.exists("/persistent-storage/segment-anything/"):
response = requests.get("https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth")
with open(file_path, "wb") as f:
f.write(response.content)
### Uploading files

# Load the model
model = torch.jit.load(file_path)
... # Continue with your initialization
```
To upload files to your persistent volume, you can use the `cerebrium cp local_path dest_path` command. This command copies files from your local machine to the specified destination path in the volume. The dest_path is optional; if not provided, the files will be uploaded to the root of the persistent volume.

Now, in subsequent deployments, the model will load from the cache rather than download it again.
```bash
Usage: cerebrium cp [OPTIONS] LOCAL_PATH REMOTE_PATH (Optional)

## Increasing your Persistent Storage Size
Copy contents to persistent volume.

<Note>Once increased, your persistent storage size cannot be decreased.</Note>
Options:
-h, --help Show this message and exit.

By default, your account is given 50GB of persistent storage to start with. However, if you find you need more (for example, you get an error saying `disk quote exceeded`) then you can increase your allocation using the following steps:
Examples:
# Copy a single file
cerebrium cp src_file_name.txt # copies to /src_file_name.txt

1. Check your current persistent storage allocation by running:
cerebrium cp src_file_name.txt dest_file_name.txt # copies to /dest_file_name.txt

# Copy a directory
cerebrium cp dir_name # copies to the root directory
cerebrium cp dir_name sub_folder/ # copies to sub_folder/
```

### Listing files

To list the files on your persistent volume, you can use the cerebrium ls [remote_path] command. This command lists all files and directories within the specified remote_path. If no remote_path is provided, it lists the contents of the root directory of the persistent volume.

```bash
cerebrium storage --get-capacity
Usage: cerebrium ls [OPTIONS] REMOTE_PATH (Optional)

List contents of persistent volume.

Options:
-h, --help Show this message and exit.

Examples:
# List all files in the root directory
cerebrium ls

# List all files in a specific folder
cerebrium ls sub_folder/
```

This will return your current persistent storage allocation in GB.
### Deleting files

2. To increase your persistent storage allocation run:
To delete files or directories from your persistent volume, use the `cerebrium rm remote_path` command. This command removes the specified file or directory from the persistent volume. Be careful, as this operation is irreversible.

```bash
cerebrium storage --increase-in-gb <number of GB to increase by>
Usage: cerebrium rm [OPTIONS] REMOTE_PATH

Remove a file or directory from persistent volume.

Options:
-h, --help Show this message and exit.

Examples:
# Remove a specific file
cerebrium rm /file_name.txt

# Remove a directory and all its contents
cerebrium rm /sub_folder/
```

### Real world example

```bash
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
cerebrium cp sam_vit_h_4b8939.pth segment-anything/sam_vit_h_4b8939.pth
```

As a simple example, suppose you have an external SAM model that you want to use in your custom deployment. You can download it to a cache directory on your persistent volume.
as such:

```python
import os
import torch

file_path = "/persistent-storage/segment-anything/sam_vit_h_4b8939.pth"

# Load the model
model = torch.jit.load(file_path)
... # Continue with your initialization
```

This will return a confirmation message and your new persistent storage allocation in GB if successful.
Now, in later inference requests, the model loads from the persistent volume instead of downloading again.
37 changes: 0 additions & 37 deletions cerebrium/deployments/async-functions.mdx

This file was deleted.

Loading

0 comments on commit 5a6c5b2

Please sign in to comment.