Replies: 1 comment 1 reply
-
Hi @AMasetti, the reason this happens is because when you're loading it back this way, you're loading it in fp32 instead of in 4 bit quantized form, which is the only version that will fit on a single T4 for inference. You can do this by modify your script as follows: import os
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
os.environ['HUGGING_FACE_HUB_TOKEN'] = '<your_hf_token>'
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
config = PeftConfig.from_pretrained("arnavgrg/codealpaca-qlora")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", quantization_config=bnb_config)
model = PeftModel.from_pretrained(model, "arnavgrg/codealpaca-qlora") Note that you may need to download the following packages: !pip install peft --quiet
!pip install transformers --quiet
!pip install bitsandbytes --quiet
!pip install accelerate --quiet |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
After following the colab for training the Llama2 7B model I'm left with the
adapter_model.bin
I'd like to load that again into google colab for further testing but into a new session. The provided code crashes the colab:Beta Was this translation helpful? Give feedback.
All reactions