Skip to content

Commit

Permalink
Hardware to GPU
Browse files Browse the repository at this point in the history
  • Loading branch information
Katsie011 committed Dec 7, 2023
1 parent 34ef00a commit 4d06747
Show file tree
Hide file tree
Showing 9 changed files with 13 additions and 14 deletions.
10 changes: 5 additions & 5 deletions available-hardware.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,10 @@ On one hand, you want the best performance possible, but on the other hand, you

## Choosing a GPU

Choosing hardware can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your model which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs.
Choosing a GPU can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your model which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs.

As a rule of thumb, the easiest way is to choose the hardware that has at least 1.5x the minimum amount of VRAM that your model requires.
This approach is conservative and will ensure that your model will fit on the hardware you choose even if you have longer inputs than you expect. However, it is just a rule of thumb and you should test the VRAM usage of your model to ensure that it will fit on the hardware you choose.
As a rule of thumb, the easiest way is to choose the GPU that has at least 1.5x the minimum amount of VRAM that your model requires.
This approach is conservative and will ensure that your model will fit on the GPU you choose even if you have longer inputs than you expect. However, it is just a rule of thumb and you should test the VRAM usage of your model to ensure that it will fit on the GOU you choose.

You can calculate the VRAM usage of your model by using the following formula:

Expand All @@ -77,14 +77,14 @@ For example, if you have a model that is 7B parameters, and you decide to use 32
modelVRAM = 7B x 4 = 28GB
```

When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the hardware you choose.
When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the GPU you choose.

Alternatively, if you were happy with the slight precision penalty of using quantisation, your model would have required 7GB of VRAM for 8-bit quantisation. So you could have chosen a GPU with 16GB of VRAM. This is the approach we recommend, especially with large models (>20B parameters) as the precision penalty is minimal and your cost savings are substantial.

<Note>
Pro tip: The precision loss from quantisation is negligible in comparison to
the performance gains you get from the larger model that can fit on the same
hardware.
GPU.
</Note>

## Setting your number of CPU Cores
Expand Down
3 changes: 1 addition & 2 deletions cerebrium/getting-started/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ cerebrium init first-project

Currently, our implementation has five components:

- **main.py** - This is where your Python code lives. This is mandatory to include.

- **main.py** - This is where your Python code lives. This is mandatory to include.
- **cerebrium.toml** - This is where you define all the configurations around your model such as the hardware you use, scaling parameters, deployment config, build parameters, etc. Check [here](../environments/initial-setup) for a full list

Every main.py you deploy needs the following mandatory layout:
Expand Down
2 changes: 1 addition & 1 deletion examples/langchain.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ We then integrate Langchain with a Cerebrium deployed endpoint to answer questio

## Deploy

Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:

```toml

Expand Down
2 changes: 1 addition & 1 deletion examples/logo-controlnet.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ def predict(item, run_id, logger):

## Deploy

Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:

```toml

Expand Down
2 changes: 1 addition & 1 deletion examples/mistral-vllm.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ The implementation in our **predict** function is pretty straight forward in tha

## Deploy

Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is an AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. Your cerebrium.toml file should look like:
Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is an AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. Your cerebrium.toml file should look like:

```toml

Expand Down
2 changes: 1 addition & 1 deletion examples/sdxl.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def predict(item, run_id, logger):

## Deploy

Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:

```toml

Expand Down
2 changes: 1 addition & 1 deletion examples/segment_anything.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ In the above code we do a few things:
We can then deploy our model to an AMPERE_A5000 instance with the following line of code

```bash
cerebrium deploy segment-anything --hardware AMPERE_A5000 --api-key private-XXXXXXXXXXXXX
cerebrium deploy segment-anything --gpu AMPERE_A5000 --api-key private-XXXXXXXXXXXXX
```

After a few minutes, your model should be deployed and an endpoint should be returned. Let us create a CURL request to see the response
Expand Down
2 changes: 1 addition & 1 deletion examples/streaming-falcon-7B.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ importantly, we use the **yield** keyword to return output from our model as its

## Deploy

Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:

```toml

Expand Down
2 changes: 1 addition & 1 deletion examples/transcribe-whisper.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ In our predict function, which only runs on inference requests, we simply create

## Deploy

Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:

```toml

Expand Down

0 comments on commit 4d06747

Please sign in to comment.