Skip to content

Latest commit





Loading large Huggingface models with constrained resources using accelerate

This document briefs on serving large HG models with limited resource using accelerate. This option can be activated with low_cpu_mem_usage=True. The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).

Step 1: Download model

Login into huggingface hub with token by running the below command

huggingface-cli login

paste the token generated from huggingface hub.

python --model_name bigscience/bloom-7b1

The script prints the path where the model is downloaded as below.


The downloaded model is around 14GB.

Step 2: Compress downloaded model

NOTE: Install Zip cli tool

Navigate to the path got from the above script. In this example it is

cd model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/
zip -r /home/ubuntu/serve/examples/Huggingface_Largemodels// *
cd -

Step 3: Generate MAR file

Navigate up to Huggingface_Largemodels directory.

torch-model-archiver --model-name bloom --version 1.0 --handler --extra-files,setup_config.json -r requirements.txt

Note: Modifying setup_config.json

Step 4: Add the mar file to model store

mkdir model_store
mv bloom.mar model_store

Step 5: Start torchserve

Update and start torchserve

torchserve --start --ncs --ts-config --disable-token-auth  --enable-model-api

Step 5: Run inference

curl -v "http://localhost:8080/predictions/bloom" -T sample_text.txt