From de0561e22bf0ff3938ed653b4913ab6d6f8a5775 Mon Sep 17 00:00:00 2001 From: Jonathan Irwin Date: Thu, 9 Nov 2023 22:03:01 -0500 Subject: [PATCH] fix: spelling, grammar and linting --- README.md | 6 ++++-- available-hardware.mdx | 18 ++++++++--------- .../persistent-storage.mdx | 4 ++-- cerebrium/endpoints/rest-api.mdx | 4 ++-- cerebrium/environments/custom-images.mdx | 8 ++++---- cerebrium/environments/initial-setup.mdx | 2 +- cerebrium/environments/model-scaling.mdx | 4 ++-- cerebrium/environments/warm-models.mdx | 2 +- .../fast-deployments-dos-and-donts.mdx | 8 ++++---- cerebrium/getting-started/introduction.mdx | 2 +- cerebrium/getting-started/quickstart.mdx | 8 ++++---- cerebrium/misc/faster-model-loading.mdx | 8 ++++---- cerebrium/prebuilt-models/introduction.mdx | 2 +- changelog.mdx | 20 +++++++++---------- examples/langchain.mdx | 4 ++-- examples/logo-controlnet.mdx | 14 ++++++------- examples/mistral-vllm.mdx | 6 +++--- examples/sdxl.mdx | 6 +++--- examples/segment_anything.mdx | 8 ++++---- examples/streaming-falcon-7B.mdx | 10 +++++----- examples/transcribe-whisper.mdx | 6 +++--- reliability-scalability.mdx | 12 +++++------ 22 files changed, 82 insertions(+), 80 deletions(-) diff --git a/README.md b/README.md index f3f82bcd..00985438 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,8 @@ Cerebrium is an AWS Sagemaker alternative providing all the features you need to ### 🚀 Setup -Install the [Mintlify CLI](https://www.npmjs.com/package/mintlify) to preview the documentation changes locally. To install, use the following command +Install the [Mintlify CLI](https://www.npmjs.com/package/mintlify) to preview the documentation changes locally. To +install, use the following command ``` npm i mintlify -g @@ -20,7 +21,8 @@ Run the following command at the root of your Mintlify application to preview ch mintlify dev ``` -Note - `mintlify dev` requires `yarn` and it's recommended you install it as a global installation. If you don't have yarn installed already run `npm install --global yarn` in your terminal. +Note - `mintlify dev` requires `yarn` and it's recommended you install it as a global installation. If you don't have +yarn installed already run `npm install --global yarn` in your terminal. ### 😎 Publishing Changes diff --git a/available-hardware.mdx b/available-hardware.mdx index e90afe48..cf42d301 100644 --- a/available-hardware.mdx +++ b/available-hardware.mdx @@ -4,7 +4,7 @@ description: "A list of hardware that is available on Cerebrium's platform." --- The Cerebrium platform allows you to quickly and easily deploy machine learning workloads on a variety of different hardware. -We take care of all the hard work so you don't have to. Everything from the hardware drivers to the scaling of your deployments is managed by us so that you can focus on what matters most: your use case. +We take care of all the hard work so you don't have to. We manage everything from the hardware drivers to the scaling of your deployments so that you can focus on what matters most: your use case. This page lists the hardware that is currently available on the platform. If you would like us to support additional hardware options on the platform please reach out to [Support](mailto:support@cerebrium.ai) @@ -30,7 +30,7 @@ These GPUs can be selected using the `--hardware` flag when deploying your model For more help with deciding which GPU you require, see this section [here](#choosing-a-gpu). _Due to the global shortage of GPUs at the moment, we may not always have the Enterprise edition of your GPU available. In this case, we will deploy to the Workstation edition of the GPU._ -_These are the same GPUs and it will not affect the performance of your model in any way._ +_These are the same GPUs, and it will not affect the performance of your model in any way._ ## CPUs @@ -47,7 +47,7 @@ Once again, you only pay for what you need! ## Storage -We provide you with a persistent storage volume that is attached to your deployment. +We provide you with a persistent storage volume attached to your deployment. You can use this storage volume to store any data that you need to persist between deployments. Accessing your persistent storage is covered in depth for [cortex here](./cerebrium/data-sharing-storage/persistent-storage). The storage volume is backed by high-performance SSDs so that you can get the best performance possible @@ -56,7 +56,7 @@ Pricing for storage is based on the amount of storage you use and is charged per # Determine your Hardware Requirements Deciding which hardware you require for your deployment can be a daunting task. -On one hand, you want the best performance possible but on the other hand, you don't want to pay for more resources than you need. +On one hand, you want the best performance possible, but on the other hand, you don't want to pay for more resources than you need. ## Choosing a GPU @@ -71,7 +71,7 @@ You can calculate the VRAM usage of your model by using the following formula: modelVRAM = numParams x numBytesPerDataType ``` -For example, if you have a model that is 7B parameters and you decide to use 32-bit Floating point precision, you can calculate the VRAM usage as follows: +For example, if you have a model that is 7B parameters, and you decide to use 32-bit Floating point precision, you can calculate the VRAM usage as follows: ```python modelVRAM = 7B x 4 = 28GB @@ -79,12 +79,12 @@ modelVRAM = 7B x 4 = 28GB When you include the 1.5x multiplier from our rule of thumb, this means that you should choose a GPU with at least ~40GB of VRAM to ensure that your model will fit on the hardware you choose. -Alternatively, if you were happy with the slight precision penalty of using quantisation, your model would have required 7GB of VRAM for 8-bit quantisation. So you could have chosen a GPU with 16GB of VRAM. This is the approach we recommend especially with large models (>20B parameters) as the precision penalty is minimal and your cost savings are substantial. +Alternatively, if you were happy with the slight precision penalty of using quantisation, your model would have required 7GB of VRAM for 8-bit quantisation. So you could have chosen a GPU with 16GB of VRAM. This is the approach we recommend, especially with large models (>20B parameters) as the precision penalty is minimal and your cost savings are substantial. - Pro tip: The precision loss from quantisation is negligible in comparison to - the performance gains you get from the larger model that can fit on the same - hardware. + Pro tip: The precision loss from quantisation is negligible in comparison to + the performance gains you get from the larger model that can fit on the same + hardware. ## Setting your number of CPU Cores diff --git a/cerebrium/data-sharing-storage/persistent-storage.mdx b/cerebrium/data-sharing-storage/persistent-storage.mdx index a057b92e..408b9cf1 100644 --- a/cerebrium/data-sharing-storage/persistent-storage.mdx +++ b/cerebrium/data-sharing-storage/persistent-storage.mdx @@ -2,8 +2,8 @@ title: "Persistent Storage" --- -Cerebrium gives to access to persistent storage in order to store model weights, files and much more. This storage volume persists across your project, meaning that if -you refer to model weights or a file that was created in a different deployment, you will be able to access it! +Cerebrium gives to access to persistent storage to store model weights, files and much more. This storage volume persists across your project, meaning that if +you refer to model weights or a file created in a different deployment, you will be able to access it! This allows you to load in model weights more efficiently as well as reduce the size of your deployment container images. Currently, the volume can be accessed through `/persistent-storage` in your container instance, should you wish to access it directly and store other artifacts. diff --git a/cerebrium/endpoints/rest-api.mdx b/cerebrium/endpoints/rest-api.mdx index 60bb796b..cee59c70 100644 --- a/cerebrium/endpoints/rest-api.mdx +++ b/cerebrium/endpoints/rest-api.mdx @@ -27,5 +27,5 @@ Responses then take the form: } ``` -All responses if successful, will return a 200 on success and a 500 on error. If you would like to return custom status codes based on certain functionality -such as 422, 404 etc just return the json parameter **status_code** from your \*main.py\*\*. +All responses, if successful, will return a 200 on success and a 500 on error. If you would like to return custom status codes based on certain functionality +such as 422, 404 etc, return the json parameter **status_code** from your \*main.py\*\*. diff --git a/cerebrium/environments/custom-images.mdx b/cerebrium/environments/custom-images.mdx index d6f757f4..6532b15c 100644 --- a/cerebrium/environments/custom-images.mdx +++ b/cerebrium/environments/custom-images.mdx @@ -5,14 +5,14 @@ description: Specify your versions, dependencies and packages to use By default, Cerebrium models are executed in Python 3.9 unless the Python version specified by you in your **config.yaml** is different. However, Cerebrium only supports version 3.9 and above. -Traditionally, when working with Python you will need access to Apt packages, Pip packages and Conda packages and so we replicate this functionality as if you were developing locally. -When creating your Cortex project you can contain the following files +Traditionally, when working with Python, you will need access to Apt packages, Pip packages and Conda packages, and so we replicate this functionality as if you were developing locally. +When creating your Cortex project, you can contain the following files - **requirements.txt** - This is where you define your Pip packages. - **pkglist.txt** - This is where you can define Linux packages you would like to install. We run the apt-install command for items here. - **conda_pkglist.txt** - This is where you can define Conda packages you would like to install if you prefer using it for some libraries over pip. You can use both conda and pip in conjunction. -Each package must be represented on a new line just as you would locally. All the files above are optional however, have to contain these file names specifically. +Each package must be represented on a new line just as you would locally. All the files above are optional, however, have to contain these file names specifically. -Typically specifying verions for packages leads to faster builds however, if you ever find you would like to change version numbers or find your library versions aren't +Typically, specifying versions for packages leads to faster builds however, if you ever find you would like to change version numbers or find your library versions aren't updating, please add the following flag to your deploy command: `cerebrium deploy model-name --force-rebuild` diff --git a/cerebrium/environments/initial-setup.mdx b/cerebrium/environments/initial-setup.mdx index 71e1f719..75ba567b 100644 --- a/cerebrium/environments/initial-setup.mdx +++ b/cerebrium/environments/initial-setup.mdx @@ -44,7 +44,7 @@ The parameters for your config file are the same as those which you would use as | `cpu` | The number of CPU cores to use | int | 2 | | `memory` | The amount of Memory to use in GB | int | 14.5 | | `log_level` | Log level for the deployment | string | INFO | -| `include` | Local iles to include in the deployment | string | '[./*, main.py, requirements.txt, pkglist.txt, conda_pkglist.txt]' | +| `include` | Local files to include in the deployment | string | '[./*, main.py, requirements.txt, pkglist.txt, conda_pkglist.txt]' | | `exclude` | Local Files to exclude from the deployment | string | '[./.*, ./__*]' | | `disable_animation` | Whether to disable the animation in the logs. | boolean | false | | `python_version` | The Python version you would like to run | float | 3.9 | diff --git a/cerebrium/environments/model-scaling.mdx b/cerebrium/environments/model-scaling.mdx index 80196e62..c2b9b049 100644 --- a/cerebrium/environments/model-scaling.mdx +++ b/cerebrium/environments/model-scaling.mdx @@ -40,9 +40,9 @@ curl --location 'https://rest-api.cerebrium.ai/update-model-scaling' \ }' ``` -Just replace the values for `minReplicaCount`, `maxReplicaCount` and `cooldownPeriodSeconds` with your desired values. All values are optional so if you don't want to update your max replicas, just leave it out of the request body. Also make sure that `name` matches the name you gave your model when you deployed it! +Replace the values for `minReplicaCount`, `maxReplicaCount` and `cooldownPeriodSeconds` with your desired values. All values are optional so if you don't want to update your max replicas, just leave it out of the request body. Also make sure that `name` matches the name you gave your model when you deployed it! -You'll recieve the following confirmation response if successful: +You'll receive the following confirmation response if successful: ``` { diff --git a/cerebrium/environments/warm-models.mdx b/cerebrium/environments/warm-models.mdx index 8553a187..de69a49b 100644 --- a/cerebrium/environments/warm-models.mdx +++ b/cerebrium/environments/warm-models.mdx @@ -3,7 +3,7 @@ title: "Keep models warm" description: "Based on traffic and implementation you might want to keep instances running" --- -While we strive to lower cold start times and improve model loading onto the GPU, you may prefer to keep your instances warm and waiting in order to handle incoming requests. +While we strive to lower cold start times and improve model loading onto the GPU, you may prefer to keep your instances warm and waiting to handle incoming requests. There are two ways to do this based on your use case: 1. Set min replicas to 1 or more. diff --git a/cerebrium/faqs-and-help/fast-deployments-dos-and-donts.mdx b/cerebrium/faqs-and-help/fast-deployments-dos-and-donts.mdx index db66a9b3..747d887c 100644 --- a/cerebrium/faqs-and-help/fast-deployments-dos-and-donts.mdx +++ b/cerebrium/faqs-and-help/fast-deployments-dos-and-donts.mdx @@ -21,14 +21,14 @@ In this brief guide, we delve into the strategies, best practices, and pitfalls ### Deploying your code - **Do** check (and lint) your code before deploying. - - Spelling mistakes and simple errors are easy to avoid and are one of the biggest handbrakes slowing down your development cycle. +- Spelling mistakes and simple errors are easy to avoid and are one of the biggest handbrakes slowing down your development cycle. - **Do**, if possible, upload unchanging files to cloud storage. - Downloading your weights from these servers in your main.py (instead of uploading them from your local computer) leverages the faster internet speeds of these servers, yielding faster deployments. + Downloading your weights from these servers in your main.py (instead of uploading them from your local computer) leverages the faster internet speeds of these servers, yielding faster deployments. - **Do** make use of the `--include` and `--exclude` flags to upload only the files you need. - **Do** as much of your prerequisite setup once, in the body of the `main.py` outside of your predict function. - - Where possible, use make use of global variables to prevent re-computing variables every time you call inference. +- Where possible, use make use of global variables to prevent re-computing variables every time you call inference. ### Downloading files and setting up models @@ -38,6 +38,6 @@ In this brief guide, we delve into the strategies, best practices, and pitfalls ## For faster inference: -- **Do** only use your inference calls for inference. As simple as it sounds, do all the setup outside of the predict function so that it runs once during the build process and is not repeated for every inference. +- **Do** only use your inference calls for inference. As simple as it sounds, do all the setup outside the predict function so that it runs once during the build process and is not repeated for every inference. - **Do** ensure you take advantage of resources if available. Ensure that your model is running on the GPU if available with: `device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')` diff --git a/cerebrium/getting-started/introduction.mdx b/cerebrium/getting-started/introduction.mdx index 0160d5d2..653d4d8e 100644 --- a/cerebrium/getting-started/introduction.mdx +++ b/cerebrium/getting-started/introduction.mdx @@ -33,7 +33,7 @@ You can send us feedback requests at [support@cerebrium.ai](mailto:support@cereb - Define pip/conda container environments in code - Secrets manager - One-click deploys -- Persistant Storage +- Persistent Storage All of this in just a few lines of code! diff --git a/cerebrium/getting-started/quickstart.mdx b/cerebrium/getting-started/quickstart.mdx index 5c298f23..172e88af 100644 --- a/cerebrium/getting-started/quickstart.mdx +++ b/cerebrium/getting-started/quickstart.mdx @@ -42,9 +42,9 @@ You need to define a function with the name **predict** which receives 3 params: - **run_id**: This is a unique identifier for the user request if you want to use it to track predictions through another system - **logger**: Cerebrium supports logging via the logger (we also support "print()" statements) however, using the logger will format your logs nicer. It contains the 3 states across most loggers: - - logger.info - - logger.debug - - logger.error +- logger.info +- logger.debug +- logger.error As long as your **main.py** contains the above you can write any other Python code. Import classes, add other functions etc. @@ -56,7 +56,7 @@ Then navigate to where your model code (specifically your `main.py`) is located cerebrium deploy my-first-model ``` -Volia! Your app should start building and you should see logs of the deployment process. It shouldn't take longer than a minute - easy peasy! +Voila! Your app should start building and you should see logs of the deployment process. It shouldn't take longer than a minute - easy peasy! ### View model statistics and logs diff --git a/cerebrium/misc/faster-model-loading.mdx b/cerebrium/misc/faster-model-loading.mdx index 83fa2b90..d37ba32e 100644 --- a/cerebrium/misc/faster-model-loading.mdx +++ b/cerebrium/misc/faster-model-loading.mdx @@ -12,9 +12,9 @@ While we've optimised the underlying hardware to load models as fast as possible ## Tensorizer (recommended) [Tensorizer](https://github.com/coreweave/tensorizer) is a library that allows you to load your model from storage into GPU memory in a single step. -While initially built to fetch models from S3, it can be used to load models from file as well and so, can be used to load models from Cerebrium's persistent storage which features a near 2GB/s read speed. -In the case of large models of 20B+ parameters, we've observed a **30-50%** decrease in model loading time which further increases with larger models. -For more information on the underlying methods, take a look at their github page [here](https://github.com/coreweave/tensorizer). +While initially built to fetch models from S3, it can be used to load models from file as well and so, can be used to load models from Cerebrium's persistent storage, which features a near 2GB/s read speed. +In the case of large models (20B+ parameters), we've observed a **30–50%** decrease in model loading time which further increases with larger models. +For more information on the underlying methods, take a look at their GitHub page [here](https://github.com/coreweave/tensorizer). In this section below, we'll show you how to use **Tensorizer** to load your model from storage straight into GPU memory in a single step. @@ -90,5 +90,5 @@ def deserialise_saved_model(model_path, model_id, plaid=True): ``` Note that your model does not need to be a transformers or even a huggingface model. -If you have a diffusers, scikit learn or even a custom pytorch model, you can still use **Tensorizer** to load your model from storage into GPU memory in a single step. +If you have a diffusers, scikit-learn or even a custom pytorch model, you can still use **Tensorizer** to load your model from storage into GPU memory in a single step. The only requirement to obtain the speedup from deserialisation is that you can initialise an empty model. The Deserialiser object will then restore the weights into the empty model. diff --git a/cerebrium/prebuilt-models/introduction.mdx b/cerebrium/prebuilt-models/introduction.mdx index e7cb0734..1d4ceb84 100644 --- a/cerebrium/prebuilt-models/introduction.mdx +++ b/cerebrium/prebuilt-models/introduction.mdx @@ -11,7 +11,7 @@ Cerebrium and its community keep a library of popular pre-built models that you You can deploy prebuilt models via Cerebrium by using a simple one-click deploy from your dashboard by navigating to the Prebuilt tab. Otherwise, if you would like to read through the source code, you can navigate to the [Cerebrium Prebuilts Github](https://github.com/CerebriumAI/cerebrium-prebuilts) where you can find the source code for each of the models. -Each model's folder is a cortex deployment that can be deployed using the `cerebrium deploy` command. Simply navigate to the folder of the model you would like to deploy and run the command. +Each model's folder is a cortex deployment that can be deployed using the `cerebrium deploy` command. Navigate to the folder of the model you would like to deploy and run the command. ```bash cerebrium deploy <> diff --git a/changelog.mdx b/changelog.mdx index 972ae549..fab5dbfd 100644 --- a/changelog.mdx +++ b/changelog.mdx @@ -44,11 +44,11 @@ description: "We release features and fixes regularly! See some of the changes w - Myriad of dependency updates. - ONNX Models can now be used in multi-model flows (this is an experimental feature). - Bug fixes: - - PyTorch Models now detach data from the GPU correctly when running Conduits. - - Fixed ONNX models input data being malformed. - - Fix CUDA driver not being loaded correctly in the Conduit runtime. +- PyTorch Models now detach data from the GPU correctly when running Conduits. +- Fixed ONNX models input data being malformed. +- Fix CUDA driver not being loaded correctly in the Conduit runtime. - LLM updates - - FLAN-T5 and Stable Diffusion no longer suffer from timeouts and severe cold starts. +- FLAN-T5 and Stable Diffusion no longer suffer from timeouts and severe cold starts. ## 0.4.1 @@ -62,14 +62,14 @@ description: "We release features and fixes regularly! See some of the changes w on feedback. We are also working on adding support for other monitoring tools, with Arize being on track to be released in Q1 2023. - Fixed a bug where the client was unable to deploy SKLearn models, XGB Regressor models, and ONNX models. - Reworked the response signature for a model. Now returns a JSON object with 3 fields: - - `result`: The data returned by the deployed Conduit. - - `run_id`: The ID of the run. - - `prediction_ids`: The prediction IDs of each prediction made. Used to track/update the predictions logged to monitoring tools. +- `result`: The data returned by the deployed Conduit. +- `run_id`: The ID of the run. +- `prediction_ids`: The prediction IDs of each prediction made. Used to track/update the predictions logged to monitoring tools. - Added pre-built deployment for the following LLMs: - - whisper-medium - - dreambooth +- whisper-medium +- dreambooth - Added webhook support for the following LLMs: - - dreambooth +- dreambooth ## 0.3.1 diff --git a/examples/langchain.mdx b/examples/langchain.mdx index 66ef21fe..1ec93318 100644 --- a/examples/langchain.mdx +++ b/examples/langchain.mdx @@ -18,7 +18,7 @@ First we create our project: cerebrium init langchain-QA ``` -We need certain Python packages in order to implement this project. Lets add those to our **_requirements.txt_** file: +We need certain Python packages to implement this project. Let's add those to our **_requirements.txt_** file: ``` pytube # For audio downloading @@ -100,7 +100,7 @@ def predict(item, run_id, logger): ### Langchain Implementation -Below, we will implement [Langchain](https://python.langchain.com/en/latest/index.html) to use a vectorstore, where we will store all our video segments above, with an LLM, locally hosted on Cerebrium, in order to generate answers. +Below, we will implement [Langchain](https://python.langchain.com/en/latest/index.html) to use a vectorstore, where we will store all our video segments above, with an LLM, locally hosted on Cerebrium, to generate answers. ```python from langchain.embeddings.openai import OpenAIEmbeddings diff --git a/examples/logo-controlnet.mdx b/examples/logo-controlnet.mdx index fcf3974d..d5ecf012 100644 --- a/examples/logo-controlnet.mdx +++ b/examples/logo-controlnet.mdx @@ -3,8 +3,8 @@ title: "ControlNet Generated Logo" description: "Generate a custom Logo using ControlNet" --- -In this tutorial, we will be using ControlNet Canny and SDXL, to alter an images of the HuggingFace logo in order to make it more appealing to the end user. SDXL is the -Stable Diffusion model released by Stability AI for high resolution image generation. ControlNet allows you to provide a image and to replace parts of the image keeping the +In this tutorial, we will be using ControlNet Canny and SDXL, to alter the images of the HuggingFace logo to make it more appealing to the end user. SDXL is the +Stable Diffusion model released by Stability AI for high-resolution image generation. ControlNet allows you to provide a image and to replace parts of the image keeping the original outlines of the image. To see the final implementation, you can view it [here](https://github.com/CerebriumAI/examples/tree/master/9-logo-controlnet) @@ -14,7 +14,7 @@ To see the final implementation, you can view it [here](https://github.com/Cereb It is important to think of the way you develop models using Cerebrium should be identical to developing on a virtual machine or Google Colab - so converting this should be very easy! Please make sure you have the Cerebrium package installed and have logged in. If not, please take a look at our docs [here](https://docs.cerebrium.ai/cerebrium/getting-started/installation) -First we create our project: +First, we create our project: ``` cerebrium init controlnet-logo @@ -62,8 +62,8 @@ Above, we import all the various Python libraries we require as well as use Pyda ## Instantiate model -Below we load in our ControlNet and SDXL models. This will download during your deployment however in subsequent deploys or inference requests it will be automatically cached in your persistent storage for subsequent use. You can read more about persistent storage [here]() -We do this outside of our **predict** function since we only want this code to run on a cold start (ie: on startup). If the container is already warm, we just want it to do inference and it will execute just the **predict** function. +Below, we load in our ControlNet and SDXL models. This will be downloaded during your deployment, however, in subsequent deploys or inference requests it will be automatically cached in your persistent storage for subsequent use. You can read more about persistent storage [here]() +We do this outside our **predict** function since we only want this code to run on a cold start (ie: on startup). If the container is already warm, we just want it to do inference and it will execute just the **predict** function. ```python controlnet = ControlNetModel.from_pretrained( @@ -82,7 +82,7 @@ pipe.enable_model_cpu_offload() ## Predict Function -Below we simply get the parameters from our request and pass it to the ControlNet model in order to generate the image(s). You will notice we convert the images to base64, this is so we can return it directly instead of writing the files to an S3 bucket - the return of the predict function needs to be JSON serializable. +Below we simply get the parameters from our request and pass it to the ControlNet model to generate the image(s). You will notice we convert the images to base64, this is so we can return it directly instead of writing the files to an S3 bucket - the return of the predict function needs to be JSON serializable. ```python def predict(item, run_id, logger): @@ -135,7 +135,7 @@ cooldown: 60 disable_animation: false ``` -To deploy the model use the following command: +To deploy the model, use the following command: ```bash cerebrium deploy controlnet-logo diff --git a/examples/mistral-vllm.mdx b/examples/mistral-vllm.mdx index ee0f0347..4b7e3d7b 100644 --- a/examples/mistral-vllm.mdx +++ b/examples/mistral-vllm.mdx @@ -18,7 +18,7 @@ First we create our project: cerebrium init mistral-vllm ``` -We need certain Python packages in order to implement this project. Lets add those to our **_requirements.txt_** file: +We need certain Python packages to implement this project. Lets add those to our **_requirements.txt_** file: ``` sentencepiece @@ -75,14 +75,14 @@ def predict(item, run_id, logger): ``` -We load the model in outside of the predict function. The reason for this is that the API request will run the predict function every time and we don't want to load our model in every request as that takes time. +We load the model outside the predict function. The reason for this is that the API request will run the predict function every time, and we don't want to load our model in every request as that takes time. The code outside the predict function will run on model startup ie: when the model is cold. The implementation in our **predict** function is pretty straight forward in that we pass input parameters from our request into the model and then generate outputs that we return to the user. ## Deploy -Your config.yaml file is where you can set your compute/environment. Please make sure that the hardware you specify is a AMPERE_A5000 and that you have enough memory (RAM) on your instance to run the models. You config.yaml file should look like: +Your config.yaml file is where you can set your compute/environment. Please make sure that the hardware you specify is an AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. Your config.yaml file should look like: ``` %YAML 1.2 diff --git a/examples/sdxl.mdx b/examples/sdxl.mdx index 747b8cd4..65dc9b03 100644 --- a/examples/sdxl.mdx +++ b/examples/sdxl.mdx @@ -58,8 +58,8 @@ Above, we import all the various Python libraries we require as well as use Pyda ## Instantiate model -Below we load in our SDXL model. This will download during your deployment however in subsequent deploys or inference requests it will be automatically cached in your persistent storage for subsequent use. You can read more about persistent storage [here]() -We do this outside of our **predict** function since we only want this code to run on a cold start (ie: on startup). If the container is already warm, we just want it to do inference and it will execute just the **predict** function. +Below, we load in our SDXL model. This will be downloaded during your deployment, however, in subsequent deploys or inference requests it will be automatically cached in your persistent storage for subsequent use. You can read more about persistent storage [here]() +We do this outside our **predict** function since we only want this code to run on a cold start (ie: on startup). If the container is already warm, we just want it to do inference and it will execute just the **predict** function. ```python pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained( @@ -70,7 +70,7 @@ pipe = pipe.to("cuda") ## Predict Function -Below we simply get the parameters from our request and pass it to the SDXL model in order to generate the image(s). You will notice we convert the images to base64, this is so we can return it directly instead of writing the files to an S3 bucket - the return of the predict function needs to be JSON serializable. +Below we simply get the parameters from our request and pass it to the SDXL model to generate the image(s). You will notice we convert the images to base64, this is so we can return it directly instead of writing the files to an S3 bucket - the return of the predict function needs to be JSON serializable. ```python def predict(item, run_id, logger): diff --git a/examples/segment_anything.mdx b/examples/segment_anything.mdx index 77d46077..dc5c639d 100644 --- a/examples/segment_anything.mdx +++ b/examples/segment_anything.mdx @@ -21,7 +21,7 @@ cerebrium init segment-anything It is important to think of the way you develop models using Cerebrium should be identical to developing on a virtual machine or Google Colab. -We need certain Python packages in order to implement this project. Lets add those to our **_requirements.txt_** file: +We need certain Python packages to implement this project. Lets add those to our **_requirements.txt_** file: ``` @@ -86,7 +86,7 @@ With the SAM model, you are able to set parameters about how precise or how deta clothes on an image, we don't want the segmentation to be that granular. You can look from Meta's documentation the different parameters you can set. In the documentation, we need to select a model type and a checkpoint. Meta recommends three — We will just use the default. I then download the ViT-H SAM model -checkpoint. You can simply add this checkpoint file to the same directory as your **main.py** and when you deploy your model, Cerebrium will automatically upload it. +checkpoint. You can add this checkpoint file to the same directory as your **main.py** and when you deploy your model, Cerebrium will automatically upload it. I then would like the user to send in a few parameters for the model via API, and so I need to define what the request object would look like: @@ -101,8 +101,8 @@ class Item(BaseModel): ``` -Pydantic is a data validation library and BaseModel is where Cerebrium keeps some default parameters like "webhook_url" that allows a user to send in a webhook url, -and we will call it when the job has finished processing — this is useful for long-running tasks. Do not worry about that functionality for this tutorial. The reason +Pydantic is a data validation library and BaseModel is where Cerebrium keeps some default parameters like "webhook_url" that allows a user to send in a webhook url. +We will call it when the job has finished processing — this is useful for long-running tasks. Do not worry about that functionality for this tutorial. The reason the user is sending in an image or a file url is giving the user a choice to send in a base64 encoded image or a publicly accessible file_url we can download. ## Identifying and classifying objects diff --git a/examples/streaming-falcon-7B.mdx b/examples/streaming-falcon-7B.mdx index 2c41e64d..42868cff 100644 --- a/examples/streaming-falcon-7B.mdx +++ b/examples/streaming-falcon-7B.mdx @@ -3,7 +3,7 @@ title: "Streaming Output - Falcon 7B" description: "Stream outputs live from Falcon 7B using SSE" --- -In this tutorial, we will show you how to implement streaming in order to return results to your users as soon as possible with the use of SSE. +In this tutorial, we will show you how to implement streaming to return results to your users as soon as possible with the use of SSE. To see the final implementation, you can view it [here](https://github.com/CerebriumAI/examples/tree/master/7-streaming-endpoint) @@ -74,12 +74,12 @@ model = AutoModelForCausalLM.from_pretrained( ) ``` -In the above we simply import the required packages and instantiate the tokenizer and model. We do this outside the **predict** function so we don't load our model weights +In the above, we simply import the required packages and instantiate the tokenizer and model. We do this outside the **predict** function, so we don't load our model weights onto the GPU with every request but rather only on model startup. ## Streaming Implementation -Below we define our predict function which will be responsible for our logic in order to stream results back from our endpoint. +Below, we define our predict function, which will be responsible for our logic to stream results back from our endpoint. ```python def predict(item, run_id, logger): @@ -111,8 +111,8 @@ def predict(item, run_id, logger): ``` -Above we receive our inputs from the request item we defined. We then implement a TextIteratorStreamer in order to stream output from the model as its ready. Lastly and most -importantly we use the **yield** keyword in order to return output from our model as its generated. +Above, we receive our inputs from the request item we defined. We then implement a TextIteratorStreamer to stream output from the model as it's ready. Lastly and most +importantly, we use the **yield** keyword to return output from our model as its generated. ## Deploy diff --git a/examples/transcribe-whisper.mdx b/examples/transcribe-whisper.mdx index 05a167c2..dac52473 100644 --- a/examples/transcribe-whisper.mdx +++ b/examples/transcribe-whisper.mdx @@ -86,12 +86,12 @@ class Item(BaseModel): ``` Above, we use Pydantic as our data validation library. Due to the way that we have defined the Base Model, "audio" and "file_url" are optional parameters but we must do a check to make sure we are given the one or the other. The webhook_endpoint parameter is something Cerebrium automatically includes in every request and can be used for long running requests. -Currently Cerebrium has a max timeout of 3 minutes for each inference request. For long audio files (2 hours) which take a couple minutes to process it would be best to use a **webhook_endpoint** which is a url we will make a **POST** request to with the results of your function. +Currently, Cerebrium has a max timeout of 3 minutes for each inference request. For long audio files (2 hours) which take a couple minutes to process it would be best to use a **webhook_endpoint** which is a url we will make a **POST** request to with the results of your function. ## Setup Model and inference -Below we import the required packages and load in our Whisper model. This will download during your deployment however in subsequent deploys or inference requests it will be automatically cached in your persistent storage for subsequent use. You can read more about persistent storage [here]() -We do this outside of our **predict** function since we only want this code to run on a cold start (ie: on startup). If the container is already warm, we just want it to do inference and it will execute just the **predict** function. +Below, we import the required packages and load in our Whisper model. This will download during your deployment however in subsequent deploys or inference requests it will be automatically cached in your persistent storage for subsequent use. You can read more about persistent storage [here]() +We do this outside our **predict** function since we only want this code to run on a cold start (ie: on startup). If the container is already warm, we just want it to do inference and it will execute just the **predict** function. ```python from huggingface_hub import hf_hub_download diff --git a/reliability-scalability.mdx b/reliability-scalability.mdx index 5b22f2ca..8607e382 100644 --- a/reliability-scalability.mdx +++ b/reliability-scalability.mdx @@ -3,21 +3,21 @@ title: "Reliability and Scalability" description: "" --- -As engineers we know how important it is for platforms providing infrastructure to remain reliable and scalable when running our clients workloads. This is something that we don't take lightly at Cerebrium and so we have implemented several processes and tooling in order to manage this as effectively as possible. +As engineers, we know how important it is for platforms providing infrastructure to remain reliable and scalable when running our clients' workloads. This is something that we don't take lightly at Cerebrium, and so we have implemented several processes and tooling to manage this as effectively as possible. ## How does Cerebrium achieve reliability and scalability? - **Automatic scaling:** -Cerebrium automatically scales instances based on the number of events in the queue and the time events are in the queue. -This ensures that Cerebrium can handle even the most demanding workloads. If one of either these two conditions are met, additional workers are spun up in < 3 seconds -in order to handle the volume. Once workers fall below a certain utilization level, we start decreasing the number of workers. +Cerebrium automatically scales instances based on the number of events in the queue, and the time events are in the queue. +This ensures that Cerebrium can handle even the most demanding workloads. If one of either of these two conditions are met, additional workers are spun up in < 3 seconds +to handle the volume. Once workers fall below a certain utilization level, we start decreasing the number of workers. -In terms of scale we are able to handle, Cerebrium has customers running at 120 transactions per second but we can do more than that:) +In terms of the scale we are able to handle, Cerebrium has customers running at 120 transactions per second, but we can do more than that :) - **Fault tolerance and High availability:** -If an instance heads into a bad state due to memory or processing issues, Cerebrium automatically restarts a new instance to handle incoming load. +If an instance heads into a bad state due to memory or processing issues, Cerebrium automatically restarts a new instance to handle the incoming load. This ensures that Cerebrium can continue to process events without user intervention. If you would like to be notified via email of any problems with your model you can toggle a switch in the top right corner of your model page - its right above your model stats.