diff --git a/available-hardware.mdx b/available-hardware.mdx index 30901857..e22ebba7 100644 --- a/available-hardware.mdx +++ b/available-hardware.mdx @@ -60,10 +60,10 @@ On one hand, you want the best performance possible, but on the other hand, you ## Choosing a GPU -Choosing hardware can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your model which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs. +Choosing a GPU can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your model which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs. -As a rule of thumb, the easiest way is to choose the hardware that has at least 1.5x the minimum amount of VRAM that your model requires. -This approach is conservative and will ensure that your model will fit on the hardware you choose even if you have longer inputs than you expect. However, it is just a rule of thumb and you should test the VRAM usage of your model to ensure that it will fit on the hardware you choose. +As a rule of thumb, the easiest way is to choose the GPU that has at least 1.5x the minimum amount of VRAM that your model requires. +This approach is conservative and will ensure that your model will fit on the GPU you choose even if you have longer inputs than you expect. However, it is just a rule of thumb and you should test the VRAM usage of your model to ensure that it will fit on the GOU you choose. You can calculate the VRAM usage of your model by using the following formula: @@ -77,14 +77,14 @@ For example, if you have a model that is 7B parameters, and you decide to use 32 modelVRAM = 7B x 4 = 28GB ``` -When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the hardware you choose. +When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the GPU you choose. Alternatively, if you were happy with the slight precision penalty of using quantisation, your model would have required 7GB of VRAM for 8-bit quantisation. So you could have chosen a GPU with 16GB of VRAM. This is the approach we recommend, especially with large models (>20B parameters) as the precision penalty is minimal and your cost savings are substantial. Pro tip: The precision loss from quantisation is negligible in comparison to the performance gains you get from the larger model that can fit on the same - hardware. + GPU. ## Setting your number of CPU Cores diff --git a/cerebrium/getting-started/quickstart.mdx b/cerebrium/getting-started/quickstart.mdx index cfa4c611..9afe4e8d 100644 --- a/cerebrium/getting-started/quickstart.mdx +++ b/cerebrium/getting-started/quickstart.mdx @@ -11,8 +11,7 @@ cerebrium init first-project Currently, our implementation has five components: -- **main.py** - This is where your Python code lives. This is mandatory to include. - +- **main.py** - This is where your Python code lives. This is mandatory to include. - **cerebrium.toml** - This is where you define all the configurations around your model such as the hardware you use, scaling parameters, deployment config, build parameters, etc. Check [here](../environments/initial-setup) for a full list Every main.py you deploy needs the following mandatory layout: diff --git a/examples/langchain.mdx b/examples/langchain.mdx index 17564cdb..4e5fd382 100644 --- a/examples/langchain.mdx +++ b/examples/langchain.mdx @@ -146,7 +146,7 @@ We then integrate Langchain with a Cerebrium deployed endpoint to answer questio ## Deploy -Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: +Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: ```toml diff --git a/examples/logo-controlnet.mdx b/examples/logo-controlnet.mdx index 1c02f949..3d1d72f1 100644 --- a/examples/logo-controlnet.mdx +++ b/examples/logo-controlnet.mdx @@ -121,7 +121,7 @@ def predict(item, run_id, logger): ## Deploy -Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: +Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: ```toml diff --git a/examples/mistral-vllm.mdx b/examples/mistral-vllm.mdx index 262df77a..e87d6b2f 100644 --- a/examples/mistral-vllm.mdx +++ b/examples/mistral-vllm.mdx @@ -83,7 +83,7 @@ The implementation in our **predict** function is pretty straight forward in tha ## Deploy -Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is an AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. Your cerebrium.toml file should look like: +Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is an AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. Your cerebrium.toml file should look like: ```toml diff --git a/examples/sdxl.mdx b/examples/sdxl.mdx index ff293ca2..062a4b57 100644 --- a/examples/sdxl.mdx +++ b/examples/sdxl.mdx @@ -102,7 +102,7 @@ def predict(item, run_id, logger): ## Deploy -Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: +Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: ```toml diff --git a/examples/segment_anything.mdx b/examples/segment_anything.mdx index c385302a..ca143af1 100644 --- a/examples/segment_anything.mdx +++ b/examples/segment_anything.mdx @@ -194,7 +194,7 @@ In the above code we do a few things: We can then deploy our model to an AMPERE_A5000 instance with the following line of code ```bash -cerebrium deploy segment-anything --hardware AMPERE_A5000 --api-key private-XXXXXXXXXXXXX +cerebrium deploy segment-anything --gpu AMPERE_A5000 --api-key private-XXXXXXXXXXXXX ``` After a few minutes, your model should be deployed and an endpoint should be returned. Let us create a CURL request to see the response diff --git a/examples/streaming-falcon-7B.mdx b/examples/streaming-falcon-7B.mdx index 7147d6e9..42ae31c1 100644 --- a/examples/streaming-falcon-7B.mdx +++ b/examples/streaming-falcon-7B.mdx @@ -117,7 +117,7 @@ importantly, we use the **yield** keyword to return output from our model as its ## Deploy -Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: +Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: ```toml diff --git a/examples/transcribe-whisper.mdx b/examples/transcribe-whisper.mdx index f9af07b2..72024771 100644 --- a/examples/transcribe-whisper.mdx +++ b/examples/transcribe-whisper.mdx @@ -121,7 +121,7 @@ In our predict function, which only runs on inference requests, we simply create ## Deploy -Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: +Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like: ```toml