-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with 8bit and 4bit quantized models #19
Comments
Hi @jordancole21 , The idea there is that we are duplicating the model before overriding its functions to inject Unlimiformer.
However, if you manage to get the following to work and submit a PR, that would be great:
I'm not sure if that should be the exact code, but the idea is to call Can you try and let me know if it works? Thanks, |
Sorry, I checked and my previous suggestion doesn't work. Do you have any idea of how to duplicate the model object, including its quantization settings? If not, I'd recommend just using Best, |
Thank you so much for the quick reply! Ok so I tried that, and it looks like it gets through the 'model = Unlimiformer.convert_model(model)' code without any issue, though when I inference the model (lmsys/fastchat-t5-3b-v1.0) it's giving me cuda errors. even though it's only a 3 billion parameter model running on an A100-40G in 4bit. Edit: For context, I'm passing in about 8k tokens into the input. And these are my unlimiformer and generate arguments: unlimiformer:
Generate
Any ideas on why I would still be getting a cuda memory error? |
What kind of cuda errors? Out of memory? |
Yes sorry, just out of memory errors:
|
Hi @jordancole21 , I'm not sure whether the 4-bit is the problem, or is it anything else. |
Oop sorry I should have sent the full thing earlier. Here's the full traceback:
I'm thinking it may be at the encoding of the input? lol this is all still a little over my head honestly |
I can't tell which step is it in, because |
Hm yeah even when I tried just the 4bit model without unlimiformer it also ran into Cuda memory issues, and then when I tried the model with full weights + Unlimiformer it still gave me a cuda memory error. But it seems to work without 4bit on this smaller model with longformer: MBZUAI/LaMini-T5-738M Also if it helps, this is the colab notebook I'm working in: https://colab.research.google.com/drive/1U1Pt6-htLzQ5gQdMBl3ZMkDXi9phzsnO?usp=sharing |
Thanks @jordancole21 , Best, |
Hey! Great work on this project! I got it t work on a couple of t5 instruction tuned models from huggingface, I was just curious, has anyone been able to get the code to work with quantized modes? Currently when I set it to 'load_in_4bit=True' I get this error:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <cell line: 1>:1 │ │ │ │ /content/unlimiformer/src/unlimiformer.py:707 in convert_model │ │ │ │ 704 │ @classmethod │ │ 705 │ def convert_model(cls, model, *args, **kwargs): │ │ 706 │ │ model_clone = AutoModelForSeq2SeqLM.from_config(model.config) │ │ ❱ 707 │ │ model_clone.load_state_dict(model.state_dict()) │ │ 708 │ │ type_to_class = { │ │ 709 │ │ │ BartModel: UnlimiformerBART, │ │ 710 │ │ │ BartForConditionalGeneration: UnlimiformerBART, │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:2041 in load_state_dict │ │ │ │ 2038 │ │ │ │ │ │ ', '.join('"{}"'.format(k) for k in missing_keys))) │ │ 2039 │ │ │ │ 2040 │ │ if len(error_msgs) > 0: │ │ ❱ 2041 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( │ │ 2042 │ │ │ │ │ │ │ self.__class__.__name__, "\n\t".join(error_msgs))) │ │ 2043 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │ │ 2044 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Error(s) in loading state_dict for T5ForConditionalGeneration: size mismatch for encoder.block.0.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for encoder.block.0.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
Does anyone have any solutions to this?
The text was updated successfully, but these errors were encountered: