Hardware to GPU

CerebriumAI · Dec 7, 2023 · 4d06747 · 4d06747
1 parent 34ef00a
commit 4d06747
Show file tree

Hide file tree

Showing 9 changed files with 13 additions and 14 deletions.
diff --git a/available-hardware.mdx b/available-hardware.mdx
@@ -60,10 +60,10 @@ On one hand, you want the best performance possible, but on the other hand, you
 
 ## Choosing a GPU
 
-Choosing hardware can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your model which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs.
+Choosing a GPU can be a complicated task of calculating VRAM usage based on the number of parameters you have as well as the length of your inputs. Additionally, some variables are dependent on your inputs to your model which will affect the VRAM usage substantially. For example, with LLMs and transformer-based architectures, you need to factor in attention processes as well as any memory-heavy positional encoding that may be happening which can increase VRAM usage exponentially for some methods. Similarly, for CNNs, you need to look at the number of filters you are using as well as the size of your inputs.
 
-As a rule of thumb, the easiest way is to choose the hardware that has at least 1.5x the minimum amount of VRAM that your model requires.
-This approach is conservative and will ensure that your model will fit on the hardware you choose even if you have longer inputs than you expect. However, it is just a rule of thumb and you should test the VRAM usage of your model to ensure that it will fit on the hardware you choose.
+As a rule of thumb, the easiest way is to choose the GPU that has at least 1.5x the minimum amount of VRAM that your model requires.
+This approach is conservative and will ensure that your model will fit on the GPU you choose even if you have longer inputs than you expect. However, it is just a rule of thumb and you should test the VRAM usage of your model to ensure that it will fit on the GOU you choose.
 
 You can calculate the VRAM usage of your model by using the following formula:
 
@@ -77,14 +77,14 @@ For example, if you have a model that is 7B parameters, and you decide to use 32
 modelVRAM = 7B x 4 = 28GB
 ```
 
-When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the hardware you choose.
+When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the GPU you choose.
 
 Alternatively, if you were happy with the slight precision penalty of using quantisation, your model would have required 7GB of VRAM for 8-bit quantisation. So you could have chosen a GPU with 16GB of VRAM. This is the approach we recommend, especially with large models (>20B parameters) as the precision penalty is minimal and your cost savings are substantial.
 
 <Note>
     Pro tip: The precision loss from quantisation is negligible in comparison to
     the performance gains you get from the larger model that can fit on the same
-    hardware.
+    GPU.
 </Note>
 
 ## Setting your number of CPU Cores

diff --git a/cerebrium/getting-started/quickstart.mdx b/cerebrium/getting-started/quickstart.mdx
@@ -11,8 +11,7 @@ cerebrium init first-project
 
 Currently, our implementation has five components:
 
-- **main.py** - This is where your Python code lives. This is mandatory to include.
-
+- **main.py** - This is where your Python code lives. This is mandatory to include.  
 - **cerebrium.toml** - This is where you define all the configurations around your model such as the hardware you use, scaling parameters, deployment config, build parameters, etc. Check [here](../environments/initial-setup) for a full list
 
 Every main.py you deploy needs the following mandatory layout:

diff --git a/examples/langchain.mdx b/examples/langchain.mdx
@@ -146,7 +146,7 @@ We then integrate Langchain with a Cerebrium deployed endpoint to answer questio
 
 ## Deploy
 
-Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
+Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
 
 ```toml
 

diff --git a/examples/logo-controlnet.mdx b/examples/logo-controlnet.mdx
@@ -121,7 +121,7 @@ def predict(item, run_id, logger):
 
 ## Deploy
 
-Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
+Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
 
 ```toml
 

diff --git a/examples/mistral-vllm.mdx b/examples/mistral-vllm.mdx
@@ -83,7 +83,7 @@ The implementation in our **predict** function is pretty straight forward in tha
 
 ## Deploy
 
-Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is an AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. Your cerebrium.toml file should look like:
+Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is an AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. Your cerebrium.toml file should look like:
 
 ```toml
 

diff --git a/examples/sdxl.mdx b/examples/sdxl.mdx
@@ -102,7 +102,7 @@ def predict(item, run_id, logger):
 
 ## Deploy
 
-Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
+Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
 
 ```toml
 

diff --git a/examples/segment_anything.mdx b/examples/segment_anything.mdx
@@ -194,7 +194,7 @@ In the above code we do a few things:
 We can then deploy our model to an AMPERE_A5000 instance with the following line of code
 
 ```bash
-cerebrium deploy segment-anything --hardware AMPERE_A5000 --api-key private-XXXXXXXXXXXXX
+cerebrium deploy segment-anything --gpu AMPERE_A5000 --api-key private-XXXXXXXXXXXXX
 ```
 
 After a few minutes, your model should be deployed and an endpoint should be returned. Let us create a CURL request to see the response

diff --git a/examples/streaming-falcon-7B.mdx b/examples/streaming-falcon-7B.mdx
@@ -117,7 +117,7 @@ importantly, we use the **yield** keyword to return output from our model as its
 
 ## Deploy
 
-Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
+Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
 
 ```toml
 

diff --git a/examples/transcribe-whisper.mdx b/examples/transcribe-whisper.mdx
@@ -121,7 +121,7 @@ In our predict function, which only runs on inference requests, we simply create
 
 ## Deploy
 
-Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
+Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
 
 ```toml
-Original file line number
+Diff line change
@@ Expand Up @@
     ## Deploy
-    Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
+    Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:
     ```toml
@@ Expand Down @@