Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA + DeBERTa: loading model gives erratic, non-deterministic results #2171

Open
2 of 4 tasks
jchook opened this issue Oct 22, 2024 · 6 comments
Open
2 of 4 tasks

LoRA + DeBERTa: loading model gives erratic, non-deterministic results #2171

jchook opened this issue Oct 22, 2024 · 6 comments

Comments

@jchook
Copy link

jchook commented Oct 22, 2024

System Info

A100 Colab

Issue

Here is a Colab repro to demonstrate my issue: https://colab.research.google.com/drive/1Z5FL1QDePY0j-8XL0o1V9o27_MbU-02E?usp=sharing

It's very possible that I am doing something fundamentally wrong or using the peft framework incorrectly. However, I tried to copy most of this code from the official LoRA sequence classification example

On this toy IMDB sentiment classification example using LoRA + DeBERTA base, I get consistent ~100% test accuracy when I do any of the following:

  • Measure accuracy on the train set during training
  • Measure accuracy on the test set immediately after training an epoch
  • FFT the base model without the PEFT framework

However, I get erratic results when saving (per-epoch) and loading a trained LoRA adapter from disk.

This occurs without any modifications to the code, in the same python session, or across fresh python sessions. So, I have ruled-out residual python state in RAM, etc.

The results appear to shift randomly between the following outcomes:

  • very high accuracy, close to 100%
  • unreasonably low accuracy, close to 0% (i.e. inverted labels)
  • something in between
{'accuracy': 0.3}
{'accuracy': 0.0}
{'accuracy': 0.979}
{'accuracy': 1.0}
{'accuracy': 0.303}
{'accuracy': 0.812}
{'accuracy': 0.591}
{'accuracy': 1.0}

Who can help?

@BenjaminBossan

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

Here is a Colab repro to demonstrate my issue: https://colab.research.google.com/drive/1Z5FL1QDePY0j-8XL0o1V9o27_MbU-02E?usp=sharing

In the Colab, after running the other cells to install, load the imdb dataset, and train a LoRA classifier, I am running this test cell repeatedly and seeing erratic results:

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "./output/lora_epoch_1"
config = PeftConfig.from_pretrained(peft_model_id)
inference_model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
inference_model = PeftModel.from_pretrained(inference_model, peft_model_id)

# Reset the metric
metric = evaluate.load("accuracy")

inference_model.to(device)
inference_model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = inference_model(**batch)
    predictions = outputs.logits.argmax(dim=-1)
    predictions, references = predictions, batch["labels"]
    metric.add_batch(
        predictions=predictions,
        references=references,
    )

eval_metric = metric.compute()
print(eval_metric)

I tried switching to PeftModelForSequenceClassification but that did not resolve the issue.

Again, this issue also occurs for me outside of Colab, on my local machine, and running python scripts that fully exit after each execution (edit: here is a repro)

This issue also occurs when I try to fix the random seeds for numpy, torch, etc.

Expected behavior

I would expect relatively similar test accuracy each time I run the test routine.

@jchook jchook changed the title LoRA for CLS_SEQ - loading from file gives erratic, non-deterministic results LoRA for SEQ_CLS - loading from file gives erratic, non-deterministic results Oct 22, 2024
@jchook jchook changed the title LoRA for SEQ_CLS - loading from file gives erratic, non-deterministic results LoRA - loading from file gives erratic, non-deterministic results Oct 22, 2024
@JINO-ROHIT
Copy link
Contributor

Hi @jchook , i think i know what the issue is, I debugged your code, seems like the reuse of inference_model variable twice is causing the issue, can you try this? it works for me, lmk.

import torch
from peft import PeftModelForSequenceClassification, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "./output/lora_epoch_1"
config = PeftConfig.from_pretrained(peft_model_id)
base_model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
inference_model = PeftModelForSequenceClassification.from_pretrained(base_model, peft_model_id)

# Reset the metric
metric = evaluate.load("accuracy")

inference_model.to(device)
inference_model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = inference_model(**batch)
    predictions = outputs.logits.argmax(dim=-1)
    predictions, references = predictions, batch["labels"]
    metric.add_batch(
        predictions=predictions,
        references=references,
    )

eval_metric = metric.compute()
print(eval_metric)

@jchook
Copy link
Author

jchook commented Oct 22, 2024

Thanks for the assistance.

I don't believe this fixes the issue.

Starting from a fresh Colab session and using your updated test loop, here is the output of my first 10 consecutive executions of the test cell:

{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 0.0}
{'accuracy': 1.0}
{'accuracy': 0.041}
{'accuracy': 0.736}
{'accuracy': 0.0}
{'accuracy': 0.033}
{'accuracy': 0.0}
{'accuracy': 0.04}

I can reproduce this using normal .py scripts on my local machine where the python interpreter fully exits between executions, so I also don't believe it has to do with lingering Colab session state.

@JINO-ROHIT
Copy link
Contributor

hmmm @jchook thats weird, can you show me how youre evaluations for n executions.

I get consistent results for each run, you can find the example here - https://colab.research.google.com/drive/1bwhOStf2PZ7CfYP7x9hrticJ_rmUWkdN?usp=sharing

jchook added a commit to jchook/peft-issue-2171 that referenced this issue Oct 22, 2024
See the issue here:
huggingface/peft#2171

Output of first 15 runs:
{'accuracy': 1.0}
{'accuracy': 0.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 0.02}
{'accuracy': 1.0}
{'accuracy': 0.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 0.0}
{'accuracy': 0.0}
{'accuracy': 0.0}
{'accuracy': 1.0}
{'accuracy': 0.0}
{'accuracy': 0.0}
@jchook
Copy link
Author

jchook commented Oct 22, 2024

@JINO-ROHIT I very much appreciate the investigation.

can you show me how youre evaluations for n executions.

The key is ensure models are reloaded on each execution. You can manually re-execute the entire test cell, or modify your code to load the LoRA and base model from within your run loop.

See code changes here

Faster execution for the sake of time

First, to make the test code run faster, I reduced the test set to only 100 examples in the dataloader config:

- dataset['test'] = dataset['test'].select(range(1000))
+ dataset['test'] = dataset['test'].select(range(100))

Ensure your run loop re-loads the LoRA from disk each time

The bug appears to have something to do with loading from disk and applying to the base model. Also, make sure you have a sufficient size for n as I sometimes see {accuracy:1.0} for the first ~5 executions.

import torch
from peft import PeftModelForSequenceClassification, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import evaluate
from tqdm import tqdm

n = 10  # Number of times to run the evaluation
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

peft_model_id = "./output/lora_epoch_1"
results = []

# Loop to run the evaluation `n` times
for run in range(n):
    config = PeftConfig.from_pretrained(peft_model_id)
    base_model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
    tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

    inference_model = PeftModelForSequenceClassification.from_pretrained(base_model, peft_model_id)

    inference_model.to(device)
    inference_model.eval()

    # Reset the metric for each run
    metric = evaluate.load("accuracy")
    print(f"Run {run + 1}/{n}")

    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = inference_model(**batch)
        predictions = outputs.logits.argmax(dim=-1)
        references = batch["labels"]
        metric.add_batch(
            predictions=predictions,
            references=references,
        )

    eval_metric = metric.compute()
    print(f"Evaluation result for run {run + 1}: {eval_metric}")
    results.append(eval_metric)
  
print("Final evaluation results:")
for result in results:
  print(result)

After ensuring that the LoRA is re-loaded on each execution and setting n=10, here are my results:

Final evaluation results:
{'accuracy': 0.0}
{'accuracy': 0.86}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 0.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}

I'm able to consistently reproduce this issue on Colab and my local environment. Here is a reproduction of the issue using normal python scripts, ensuring python fully exits between each invocation of test.py: https://github.com/jchook/peft-issue-2171

You can see the first 100 test.py invocation results.

@JINO-ROHIT
Copy link
Contributor

ahhh i see , not sure why then, the other bits of the code seem mostly okay, did pushing to hub and load help?

ill try and see if something else works, meanwhile we can wait for @BenjaminBossan

@jchook
Copy link
Author

jchook commented Oct 22, 2024

did pushing to hub and load help?

Apparently saving/loading from the Hub does not fix the issue either. See this repro on Colab.

Final results:
{'accuracy': 0.0}
{'accuracy': 0.525}
{'accuracy': 0.0}
{'accuracy': 0.99}
{'accuracy': 0.0}
{'accuracy': 0.9}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 0.73}
{'accuracy': 1.0}

However, changing the base model from DeBERTa V3 to RoBERTa seems to resolve the issue. So the problem may be DeBERTa-specific somehow.

Final results:
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}

@jchook jchook changed the title LoRA - loading from file gives erratic, non-deterministic results LoRA + DeBERTA: loading model gives erratic, non-deterministic results Oct 22, 2024
@jchook jchook changed the title LoRA + DeBERTA: loading model gives erratic, non-deterministic results LoRA + DeBERTa: loading model gives erratic, non-deterministic results Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants