This document briefs on serving large HG models with limited resource using accelerate. This option can be activated with low_cpu_mem_usage=True
. The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).
Login into huggingface hub with token by running the below command
huggingface-cli login
paste the token generated from huggingface hub.
python Download_model.py --model_name bigscience/bloom-7b1
The script prints the path where the model is downloaded as below.
model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/
The downloaded model is around 14GB.
NOTE: Install Zip cli tool
Navigate to the path got from the above script. In this example it is
cd model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/
zip -r /home/ubuntu/serve/examples/Huggingface_Largemodels//model.zip *
cd -
Navigate up to Huggingface_Largemodels
directory.
torch-model-archiver --model-name bloom --version 1.0 --handler custom_handler.py --extra-files model.zip,setup_config.json -r requirements.txt
Note: Modifying setup_config.json
- Enable
low_cpu_mem_usage
to use accelerate - Recommended
max_memory
insetup_config.json
is the max size of shard. - Refer: https://huggingface.co/docs/transformers/main_classes/model#large-model-loading
mkdir model_store
mv bloom.mar model_store
Update config.properties and start torchserve
torchserve --start --ncs --ts-config config.properties --disable-token-auth --enable-model-api
curl -v "http://localhost:8080/predictions/bloom" -T sample_text.txt