Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: element 0 of tensors.. OpenCLIP model #2200

Open
2 of 4 tasks
EngEmmanuel opened this issue Nov 5, 2024 · 4 comments
Open
2 of 4 tasks

RuntimeError: element 0 of tensors.. OpenCLIP model #2200

EngEmmanuel opened this issue Nov 5, 2024 · 4 comments

Comments

@EngEmmanuel
Copy link

System Info

peft = 0.13.2
python = 3.12.7
transformers = 4.45.2

Who can help?

@sayakpaul

I am using inject_adapter_model(...) to finetune a model from OpenCLIP using LoRA layers. I am able to finetune the model by modifying Linear() layers and other supported types as expected. However, there is a model that I am currently training that has an attention module called "out_proj" that has the following layer type NonDynamicallyQuantizableLinear(Linear) . I may be mistaken but from my understanding of the source code for NonDynamicallyQuantizableLinear(https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/linear.py#L136), I should be able to treat it as just a typical torch.nn.Linear layer for my purposes. However, I always get the following error: "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn". The lora layers are also added as expected.

  (transformer): Transformer(
    (resblocks): ModuleList(
      (0-11): 12 x ResidualAttentionBlock(
        (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (attn): MultiheadAttention(
          (out_proj): lora.Linear(
            (base_layer): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=512, out_features=32, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=32, out_features=512, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
        )

The layers are successfully added when I add it via target_modules and also if I use register_custom_modules with the following mapping torch.nn.modules.linear.NonDynamicallyQuantizableLinear -> peft.tuners.lora.layer.Linear. However, neither case trains. Furthermore, the model trains when I include any other layers e.g. a fully-connected one that's of type torch.nn.Linear.
target_modules =

  • [out_proj] doesn't train
  • [fc1] trains
  • [out_proj, fc1] trains (I have to conclude that the attention layers aren't really being trained in this case and this config is equivalent to the one immediately above)

Any idea why this may be the case? Your help would be truly appreciated

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

Train step:

def train_step(model, batch, loss_fn, device, trainer_cfg):
    images, tokenized_texts = batch
    images, tokenized_texts = images.to(device), tokenized_texts.to(device)

    # Forward pass: Get embeddings for images and texts # (bs, 512), (bs, 512), scalar
    image_features, text_features, scale_exp = model(images, tokenized_texts)

    # Compute logits as dot products between image and text features # (bs, bs)
    logits_per_image = (image_features @ text_features.T) / scale_exp
    logits_per_text = logits_per_image.T

    # Create labels (diagonal is the correct match)
    labels = torch.arange(images.shape[0], device=device)

    # Compute loss (bs,)
    loss = (loss_fn(logits_per_image, labels) +
            loss_fn(logits_per_text, labels)) / 2

    # If gradient accumulation is used, normalize the loss
    accumulation_steps = trainer_cfg.get('accumulation_steps', 1)
    if accumulation_steps > 1:
        loss = loss / accumulation_steps

    # Backward pass
    loss.backward()

    # Apply gradient clipping if necessary
    max_grad_norm = trainer_cfg.get('max_grad_norm', None)
    if max_grad_norm is not None:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

    return loss.item()

Model structure near a layer of interest:
('transformer.resblocks.11.attn', <class 'torch.nn.modules.activation.MultiheadAttention'>)
('transformer.resblocks.11.attn.out_proj', <class 'torch.nn.modules.linear.NonDynamicallyQuantizableLinear'>)
('transformer.resblocks.11.ls_1', <class 'torch.nn.modules.linear.Identity'>)

Injection code:


    model_path = cfg.training.model_name
    # Load pretrained model, tokenizer, and image processor
    model, preprocess_train, preprocess_val = create_model_and_transforms(model_path)
    tokenizer = get_tokenizer(model_path)


    print("Before adapting..")
    total_params, trainable_params, trainable_percent = count_parameters(model)

   lora_config = LoraConfig(**cfg.lora_config.kwargs)

    # kwargs:
    #    target_modules: ["out_proj"] 
    #    r: 32  # Rank of the LoRA low-rank matrices
    #    lora_alpha: 32  # Scaling factor for LoRA updates
    #    lora_dropout: 0.1  # Dropout for LoRA layers to avoid overfitting
    #    bias: 'none'  # Whether to use bias in LoRA layers []'none', 'all', 'lora_only']

  # 



    lora_model = inject_adapter_in_model(lora_config, model)

Expected behavior

I would expect it to begin training. Here are the first few print outs of atypical run

[2024-11-05 20:01:55,364][utils.loggers][INFO] - Total Parameters: 154,406,529
[2024-11-05 20:01:55,366][utils.loggers][INFO] - Trainable Parameters: 2,887,680 (1.87%)
[2024-11-05 20:01:55,371][utils.loggers][INFO] - Patience scaled to 10 validation steps
Epoch 1/105: 0it [00:00, ?it/s]c:\Users\spet4299\Anaconda3\envs\tee_clip_env\Lib\site-packages\torch\nn\functional.py:5476: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
[2024-11-05 20:01:57,719][utils.loggers][INFO] - Epoch 1 | step 1 | Train Loss = 2.7726
Epoch 1/105: 1it [00:02,  2.34s/it][2024-11-05 20:01:58,755][utils.loggers][INFO] - Epoch 1 | step 2 | Train Loss = 2.7726
Epoch 1/105: 2it [00:03,  1.57s/it][2024-11-05 20:01:59,659][utils.loggers][INFO] - Epoch 1 | step 3 | Train Loss = 2.7726
@EngEmmanuel
Copy link
Author

Update: I've found some discussions on a similar issue

#761 Discusses this. #1324 aims to fix this

@BenjaminBossan
Copy link
Member

Indeed, for MultiheadAttention, we have to jump through some hoops to make it work. Hopefully that PR can be merged soon, but there might still be some edge cases we haven't accounted for. If you can give that branch a try and report back if it worked for you, that would help us determine if the branch works correctly.

@EngEmmanuel
Copy link
Author

EngEmmanuel commented Nov 7, 2024

Thanks for your reply and your work on this library. Below are the results I am seeing for different inputs to the "target_modules" arg :

main branch:
"out_proj": the Runtime Error in the original problem
len([m for m in peft_model.modules() if isinstance(m, PeftMha)])) equates to 0

"attn": ValueError: Target module MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `transformers.pytorch_utils.Conv1D`.

your branch:
"out_proj": the Runtime Error in the original problem
len([m for m in peft_model.modules() if isinstance(m, PeftMha)])) equates to 12

"attn": Trains! (Thank you)

Q1)So I understand the consequences for my purpose, could I please clarify a few things?
Looking at my model print in the original issue, am I right in thinking that "out_proj" corresponds to the W_{o} matrix, and "attn" corresponds to W_{q,k,v,o}. If so, the above result would mean I basically have to lora all the attention weight matrices instead of having the option to select just the query matrix (W_{q}) for example?

Q2) What's the main difference between inject_adapter_in_model(...) and get_peft_model(...)?
I just want a model that has extra lora layers added and that I can then immediately start training. Is there a perk to using either for my purpose?

@BenjaminBossan
Copy link
Member

To your questions:

  1. Yes, when the module unifies all the attention layers into a single parameter, we cannot currently choose to target just query for instance. Typically, this does not decrease performance though.
  2. inject_adapter_in_model is a low level API that adds the PEFT layers to the model but leaves the model as is otherwise. For get_peft_model, you get a PeftModel instance back that wraps the original model. This PeftModel has a bunch of convenience methods you will most likely want to use at a later point, like merging the layers. If you're sure you don't need those, you can use inject_adapter_in_model instead but I would recommend get_peft_model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants