Skip to content

Commit

Permalink
Merge branch 'feature/rework'
Browse files Browse the repository at this point in the history
  • Loading branch information
jonoirwinrsa committed Nov 10, 2023
2 parents 1bfac82 + de0561e commit 0253324
Show file tree
Hide file tree
Showing 22 changed files with 82 additions and 80 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ Cerebrium is an AWS Sagemaker alternative providing all the features you need to

### 🚀 Setup

Install the [Mintlify CLI](https://www.npmjs.com/package/mintlify) to preview the documentation changes locally. To install, use the following command
Install the [Mintlify CLI](https://www.npmjs.com/package/mintlify) to preview the documentation changes locally. To
install, use the following command

```
npm i mintlify -g
Expand All @@ -20,7 +21,8 @@ Run the following command at the root of your Mintlify application to preview ch
mintlify dev
```

Note - `mintlify dev` requires `yarn` and it's recommended you install it as a global installation. If you don't have yarn installed already run `npm install --global yarn` in your terminal.
Note - `mintlify dev` requires `yarn` and it's recommended you install it as a global installation. If you don't have
yarn installed already run `npm install --global yarn` in your terminal.

### 😎 Publishing Changes

Expand Down
18 changes: 9 additions & 9 deletions available-hardware.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: "A list of hardware that is available on Cerebrium's platform."
---

The Cerebrium platform allows you to quickly and easily deploy machine learning workloads on a variety of different hardware.
We take care of all the hard work so you don't have to. Everything from the hardware drivers to the scaling of your deployments is managed by us so that you can focus on what matters most: your use case.
We take care of all the hard work so you don't have to. We manage everything from the hardware drivers to the scaling of your deployments so that you can focus on what matters most: your use case.

This page lists the hardware that is currently available on the platform. If you would like us to support additional hardware options on the platform please reach out to [Support](mailto:[email protected])

Expand All @@ -30,7 +30,7 @@ These GPUs can be selected using the `--hardware` flag when deploying your model
For more help with deciding which GPU you require, see this section [here](#choosing-a-gpu).

_Due to the global shortage of GPUs at the moment, we may not always have the Enterprise edition of your GPU available. In this case, we will deploy to the Workstation edition of the GPU._
_These are the same GPUs and it will not affect the performance of your model in any way._
_These are the same GPUs, and it will not affect the performance of your model in any way._

## CPUs

Expand All @@ -47,7 +47,7 @@ Once again, you only pay for what you need!

## Storage

We provide you with a persistent storage volume that is attached to your deployment.
We provide you with a persistent storage volume attached to your deployment.
You can use this storage volume to store any data that you need to persist between deployments. Accessing your persistent storage is covered in depth for [cortex here](./cerebrium/data-sharing-storage/persistent-storage).

The storage volume is backed by high-performance SSDs so that you can get the best performance possible
Expand All @@ -56,7 +56,7 @@ Pricing for storage is based on the amount of storage you use and is charged per
# Determine your Hardware Requirements

Deciding which hardware you require for your deployment can be a daunting task.
On one hand, you want the best performance possible but on the other hand, you don't want to pay for more resources than you need.
On one hand, you want the best performance possible, but on the other hand, you don't want to pay for more resources than you need.

## Choosing a GPU

Expand All @@ -71,20 +71,20 @@ You can calculate the VRAM usage of your model by using the following formula:
modelVRAM = numParams x numBytesPerDataType
```

For example, if you have a model that is 7B parameters and you decide to use 32-bit Floating point precision, you can calculate the VRAM usage as follows:
For example, if you have a model that is 7B parameters, and you decide to use 32-bit Floating point precision, you can calculate the VRAM usage as follows:

```python
modelVRAM = 7B x 4 = 28GB
```

When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the hardware you choose.

Alternatively, if you were happy with the slight precision penalty of using quantisation, your model would have required 7GB of VRAM for 8-bit quantisation. So you could have chosen a GPU with 16GB of VRAM. This is the approach we recommend especially with large models (>20B parameters) as the precision penalty is minimal and your cost savings are substantial.
Alternatively, if you were happy with the slight precision penalty of using quantisation, your model would have required 7GB of VRAM for 8-bit quantisation. So you could have chosen a GPU with 16GB of VRAM. This is the approach we recommend, especially with large models (>20B parameters) as the precision penalty is minimal and your cost savings are substantial.

<Note>
Pro tip: The precision loss from quantisation is negligible in comparison to
the performance gains you get from the larger model that can fit on the same
hardware.
Pro tip: The precision loss from quantisation is negligible in comparison to
the performance gains you get from the larger model that can fit on the same
hardware.
</Note>

## Setting your number of CPU Cores
Expand Down
4 changes: 2 additions & 2 deletions cerebrium/data-sharing-storage/persistent-storage.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
title: "Persistent Storage"
---

Cerebrium gives to access to persistent storage in order to store model weights, files and much more. This storage volume persists across your project, meaning that if
you refer to model weights or a file that was created in a different deployment, you will be able to access it!
Cerebrium gives to access to persistent storage to store model weights, files and much more. This storage volume persists across your project, meaning that if
you refer to model weights or a file created in a different deployment, you will be able to access it!

This allows you to load in model weights more efficiently as well as reduce the size of your deployment container images. Currently,
the volume can be accessed through `/persistent-storage` in your container instance, should you wish to access it directly and store other artifacts.
Expand Down
4 changes: 2 additions & 2 deletions cerebrium/endpoints/rest-api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,5 +27,5 @@ Responses then take the form:
}
```
All responses if successful, will return a 200 on success and a 500 on error. If you would like to return custom status codes based on certain functionality
such as 422, 404 etc just return the json parameter **status_code** from your \*main.py\*\*.
All responses, if successful, will return a 200 on success and a 500 on error. If you would like to return custom status codes based on certain functionality
such as 422, 404 etc, return the json parameter **status_code** from your \*main.py\*\*.
8 changes: 4 additions & 4 deletions cerebrium/environments/custom-images.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@ description: Specify your versions, dependencies and packages to use

By default, Cerebrium models are executed in Python 3.9 unless the Python version specified by you in your **config.yaml** is different. However, Cerebrium only supports version 3.9 and above.

Traditionally, when working with Python you will need access to Apt packages, Pip packages and Conda packages and so we replicate this functionality as if you were developing locally.
When creating your Cortex project you can contain the following files
Traditionally, when working with Python, you will need access to Apt packages, Pip packages and Conda packages, and so we replicate this functionality as if you were developing locally.
When creating your Cortex project, you can contain the following files

- **requirements.txt** - This is where you define your Pip packages.
- **pkglist.txt** - This is where you can define Linux packages you would like to install. We run the apt-install command for items here.
- **conda_pkglist.txt** - This is where you can define Conda packages you would like to install if you prefer using it for some libraries over pip. You can use both conda and pip in conjunction.

Each package must be represented on a new line just as you would locally. All the files above are optional however, have to contain these file names specifically.
Each package must be represented on a new line just as you would locally. All the files above are optional, however, have to contain these file names specifically.

Typically specifying verions for packages leads to faster builds however, if you ever find you would like to change version numbers or find your library versions aren't
Typically, specifying versions for packages leads to faster builds however, if you ever find you would like to change version numbers or find your library versions aren't
updating, please add the following flag to your deploy command: `cerebrium deploy model-name --force-rebuild`
2 changes: 1 addition & 1 deletion cerebrium/environments/initial-setup.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ The parameters for your config file are the same as those which you would use as
| `cpu` | The number of CPU cores to use | int | 2 |
| `memory` | The amount of Memory to use in GB | int | 14.5 |
| `log_level` | Log level for the deployment | string | INFO |
| `include` | Local iles to include in the deployment | string | '[./*, main.py, requirements.txt, pkglist.txt, conda_pkglist.txt]' |
| `include` | Local files to include in the deployment | string | '[./*, main.py, requirements.txt, pkglist.txt, conda_pkglist.txt]' |
| `exclude` | Local Files to exclude from the deployment | string | '[./.*, ./__*]' |
| `disable_animation` | Whether to disable the animation in the logs. | boolean | false |
| `python_version` | The Python version you would like to run | float | 3.9 |
Expand Down
4 changes: 2 additions & 2 deletions cerebrium/environments/model-scaling.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@ curl --location 'https://rest-api.cerebrium.ai/update-model-scaling' \
}'
```
Just replace the values for `minReplicaCount`, `maxReplicaCount` and `cooldownPeriodSeconds` with your desired values. All values are optional so if you don't want to update your max replicas, just leave it out of the request body. Also make sure that `name` matches the name you gave your model when you deployed it!
Replace the values for `minReplicaCount`, `maxReplicaCount` and `cooldownPeriodSeconds` with your desired values. All values are optional so if you don't want to update your max replicas, just leave it out of the request body. Also make sure that `name` matches the name you gave your model when you deployed it!
You'll recieve the following confirmation response if successful:
You'll receive the following confirmation response if successful:
```
{
Expand Down
2 changes: 1 addition & 1 deletion cerebrium/environments/warm-models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "Keep models warm"
description: "Based on traffic and implementation you might want to keep instances running"
---

While we strive to lower cold start times and improve model loading onto the GPU, you may prefer to keep your instances warm and waiting in order to handle incoming requests.
While we strive to lower cold start times and improve model loading onto the GPU, you may prefer to keep your instances warm and waiting to handle incoming requests.
There are two ways to do this based on your use case:

1. Set min replicas to 1 or more.
Expand Down
8 changes: 4 additions & 4 deletions cerebrium/faqs-and-help/fast-deployments-dos-and-donts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ In this brief guide, we delve into the strategies, best practices, and pitfalls
### Deploying your code

- **Do** check (and lint) your code before deploying.
- Spelling mistakes and simple errors are easy to avoid and are one of the biggest handbrakes slowing down your development cycle.
- Spelling mistakes and simple errors are easy to avoid and are one of the biggest handbrakes slowing down your development cycle.
- **Do**, if possible, upload unchanging files to cloud storage.
Downloading your weights from these servers in your main.py (instead of uploading them from your local computer) leverages the faster internet speeds of these servers, yielding faster deployments.
Downloading your weights from these servers in your main.py (instead of uploading them from your local computer) leverages the faster internet speeds of these servers, yielding faster deployments.

- **Do** make use of the `--include` and `--exclude` flags to upload only the files you need.

- **Do** as much of your prerequisite setup once, in the body of the `main.py` outside of your predict function.
- Where possible, use make use of global variables to prevent re-computing variables every time you call inference.
- Where possible, use make use of global variables to prevent re-computing variables every time you call inference.

### Downloading files and setting up models

Expand All @@ -38,6 +38,6 @@ In this brief guide, we delve into the strategies, best practices, and pitfalls

## For faster inference:

- **Do** only use your inference calls for inference. As simple as it sounds, do all the setup outside of the predict function so that it runs once during the build process and is not repeated for every inference.
- **Do** only use your inference calls for inference. As simple as it sounds, do all the setup outside the predict function so that it runs once during the build process and is not repeated for every inference.
- **Do** ensure you take advantage of resources if available. Ensure that your model is running on the GPU if available with:
`device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')`
2 changes: 1 addition & 1 deletion cerebrium/getting-started/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ You can send us feedback requests at [[email protected]](mailto:support@cereb
- Define pip/conda container environments in code
- Secrets manager
- One-click deploys
- Persistant Storage
- Persistent Storage

<b>All of this in just a few lines of code!</b>

Expand Down
8 changes: 4 additions & 4 deletions cerebrium/getting-started/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ You need to define a function with the name **predict** which receives 3 params:
- **run_id**: This is a unique identifier for the user request if you want to use it to track predictions through another system
- **logger**: Cerebrium supports logging via the logger (we also support "print()" statements) however, using the logger will format your logs nicer. It contains
the 3 states across most loggers:
- logger.info
- logger.debug
- logger.error
- logger.info
- logger.debug
- logger.error
As long as your **main.py** contains the above you can write any other Python code. Import classes, add other functions etc.
Expand All @@ -56,7 +56,7 @@ Then navigate to where your model code (specifically your `main.py`) is located
cerebrium deploy my-first-model
```
Volia! Your app should start building and you should see logs of the deployment process. It shouldn't take longer than a minute - easy peasy!
Voila! Your app should start building and you should see logs of the deployment process. It shouldn't take longer than a minute - easy peasy!
### View model statistics and logs
Expand Down
8 changes: 4 additions & 4 deletions cerebrium/misc/faster-model-loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ While we've optimised the underlying hardware to load models as fast as possible
## Tensorizer (recommended)

[Tensorizer](https://github.com/coreweave/tensorizer) is a library that allows you to load your model from storage into GPU memory in a single step.
While initially built to fetch models from S3, it can be used to load models from file as well and so, can be used to load models from Cerebrium's persistent storage which features a near 2GB/s read speed.
In the case of large models of 20B+ parameters, we've observed a **30-50%** decrease in model loading time which further increases with larger models.
For more information on the underlying methods, take a look at their github page [here](https://github.com/coreweave/tensorizer).
While initially built to fetch models from S3, it can be used to load models from file as well and so, can be used to load models from Cerebrium's persistent storage, which features a near 2GB/s read speed.
In the case of large models (20B+ parameters), we've observed a **3050%** decrease in model loading time which further increases with larger models.
For more information on the underlying methods, take a look at their GitHub page [here](https://github.com/coreweave/tensorizer).

In this section below, we'll show you how to use **Tensorizer** to load your model from storage straight into GPU memory in a single step.

Expand Down Expand Up @@ -90,5 +90,5 @@ def deserialise_saved_model(model_path, model_id, plaid=True):
```

Note that your model does not need to be a transformers or even a huggingface model.
If you have a diffusers, scikit learn or even a custom pytorch model, you can still use **Tensorizer** to load your model from storage into GPU memory in a single step.
If you have a diffusers, scikit-learn or even a custom pytorch model, you can still use **Tensorizer** to load your model from storage into GPU memory in a single step.
The only requirement to obtain the speedup from deserialisation is that you can initialise an empty model. The Deserialiser object will then restore the weights into the empty model.
2 changes: 1 addition & 1 deletion cerebrium/prebuilt-models/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Cerebrium and its community keep a library of popular pre-built models that you
You can deploy prebuilt models via Cerebrium by using a simple one-click deploy from your dashboard by navigating to the <b>Prebuilt tab</b>. Otherwise, if
you would like to read through the source code, you can navigate to the [Cerebrium Prebuilts Github](https://github.com/CerebriumAI/cerebrium-prebuilts) where you can find the source code for each of the models.

Each model's folder is a cortex deployment that can be deployed using the `cerebrium deploy` command. Simply navigate to the folder of the model you would like to deploy and run the command.
Each model's folder is a cortex deployment that can be deployed using the `cerebrium deploy` command. Navigate to the folder of the model you would like to deploy and run the command.

```bash
cerebrium deploy <<your-name-for-your-model>>
Expand Down
Loading

0 comments on commit 0253324

Please sign in to comment.