LoRA + DeBERTa: loading model gives erratic, non-deterministic results #2171

jchook · 2024-10-22T14:08:52Z

System Info

A100 Colab

Issue

Here is a Colab repro to demonstrate my issue: https://colab.research.google.com/drive/1Z5FL1QDePY0j-8XL0o1V9o27_MbU-02E?usp=sharing

It's very possible that I am doing something fundamentally wrong or using the peft framework incorrectly. However, I tried to copy most of this code from the official LoRA sequence classification example

On this toy IMDB sentiment classification example using LoRA + DeBERTA base, I get consistent ~100% test accuracy when I do any of the following:

Measure accuracy on the train set during training
Measure accuracy on the test set immediately after training an epoch
FFT the base model without the PEFT framework

However, I get erratic results when saving (per-epoch) and loading a trained LoRA adapter from disk.

This occurs without any modifications to the code, in the same python session, or across fresh python sessions. So, I have ruled-out residual python state in RAM, etc.

The results appear to shift randomly between the following outcomes:

very high accuracy, close to 100%
unreasonably low accuracy, close to 0% (i.e. inverted labels)
something in between

{'accuracy': 0.3}
{'accuracy': 0.0}
{'accuracy': 0.979}
{'accuracy': 1.0}
{'accuracy': 0.303}
{'accuracy': 0.812}
{'accuracy': 0.591}
{'accuracy': 1.0}

Who can help?

@BenjaminBossan

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

Here is a Colab repro to demonstrate my issue: https://colab.research.google.com/drive/1Z5FL1QDePY0j-8XL0o1V9o27_MbU-02E?usp=sharing

In the Colab, after running the other cells to install, load the imdb dataset, and train a LoRA classifier, I am running this test cell repeatedly and seeing erratic results:

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "./output/lora_epoch_1"
config = PeftConfig.from_pretrained(peft_model_id)
inference_model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
inference_model = PeftModel.from_pretrained(inference_model, peft_model_id)

# Reset the metric
metric = evaluate.load("accuracy")

inference_model.to(device)
inference_model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = inference_model(**batch)
    predictions = outputs.logits.argmax(dim=-1)
    predictions, references = predictions, batch["labels"]
    metric.add_batch(
        predictions=predictions,
        references=references,
    )

eval_metric = metric.compute()
print(eval_metric)

I tried switching to PeftModelForSequenceClassification but that did not resolve the issue.

Again, this issue also occurs for me outside of Colab, on my local machine, and running python scripts that fully exit after each execution (edit: here is a repro)

This issue also occurs when I try to fix the random seeds for numpy, torch, etc.

Expected behavior

I would expect relatively similar test accuracy each time I run the test routine.

The text was updated successfully, but these errors were encountered:

JINO-ROHIT · 2024-10-22T16:22:44Z

Hi @jchook , i think i know what the issue is, I debugged your code, seems like the reuse of inference_model variable twice is causing the issue, can you try this? it works for me, lmk.

import torch
from peft import PeftModelForSequenceClassification, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "./output/lora_epoch_1"
config = PeftConfig.from_pretrained(peft_model_id)
base_model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
inference_model = PeftModelForSequenceClassification.from_pretrained(base_model, peft_model_id)

# Reset the metric
metric = evaluate.load("accuracy")

inference_model.to(device)
inference_model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = inference_model(**batch)
    predictions = outputs.logits.argmax(dim=-1)
    predictions, references = predictions, batch["labels"]
    metric.add_batch(
        predictions=predictions,
        references=references,
    )

eval_metric = metric.compute()
print(eval_metric)

jchook · 2024-10-22T16:39:28Z

Thanks for the assistance.

I don't believe this fixes the issue.

Starting from a fresh Colab session and using your updated test loop, here is the output of my first 10 consecutive executions of the test cell:

{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 0.0}
{'accuracy': 1.0}
{'accuracy': 0.041}
{'accuracy': 0.736}
{'accuracy': 0.0}
{'accuracy': 0.033}
{'accuracy': 0.0}
{'accuracy': 0.04}

I can reproduce this using normal .py scripts on my local machine where the python interpreter fully exits between executions, so I also don't believe it has to do with lingering Colab session state.

JINO-ROHIT · 2024-10-22T17:00:59Z

hmmm @jchook thats weird, can you show me how youre evaluations for n executions.

I get consistent results for each run, you can find the example here - https://colab.research.google.com/drive/1bwhOStf2PZ7CfYP7x9hrticJ_rmUWkdN?usp=sharing

See the issue here: huggingface/peft#2171 Output of first 15 runs: {'accuracy': 1.0} {'accuracy': 0.0} {'accuracy': 1.0} {'accuracy': 1.0} {'accuracy': 0.02} {'accuracy': 1.0} {'accuracy': 0.0} {'accuracy': 1.0} {'accuracy': 1.0} {'accuracy': 0.0} {'accuracy': 0.0} {'accuracy': 0.0} {'accuracy': 1.0} {'accuracy': 0.0} {'accuracy': 0.0}

jchook · 2024-10-22T18:05:04Z

@JINO-ROHIT I very much appreciate the investigation.

can you show me how youre evaluations for n executions.

The key is ensure models are reloaded on each execution. You can manually re-execute the entire test cell, or modify your code to load the LoRA and base model from within your run loop.

See code changes here

Faster execution for the sake of time

First, to make the test code run faster, I reduced the test set to only 100 examples in the dataloader config:

- dataset['test'] = dataset['test'].select(range(1000))
+ dataset['test'] = dataset['test'].select(range(100))

Ensure your run loop re-loads the LoRA from disk each time

The bug appears to have something to do with loading from disk and applying to the base model. Also, make sure you have a sufficient size for n as I sometimes see {accuracy:1.0} for the first ~5 executions.

import torch
from peft import PeftModelForSequenceClassification, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import evaluate
from tqdm import tqdm

n = 10  # Number of times to run the evaluation
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

peft_model_id = "./output/lora_epoch_1"
results = []

# Loop to run the evaluation `n` times
for run in range(n):
    config = PeftConfig.from_pretrained(peft_model_id)
    base_model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
    tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

    inference_model = PeftModelForSequenceClassification.from_pretrained(base_model, peft_model_id)

    inference_model.to(device)
    inference_model.eval()

    # Reset the metric for each run
    metric = evaluate.load("accuracy")
    print(f"Run {run + 1}/{n}")

    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = inference_model(**batch)
        predictions = outputs.logits.argmax(dim=-1)
        references = batch["labels"]
        metric.add_batch(
            predictions=predictions,
            references=references,
        )

    eval_metric = metric.compute()
    print(f"Evaluation result for run {run + 1}: {eval_metric}")
    results.append(eval_metric)
  
print("Final evaluation results:")
for result in results:
  print(result)

After ensuring that the LoRA is re-loaded on each execution and setting n=10, here are my results:

Final evaluation results:
{'accuracy': 0.0}
{'accuracy': 0.86}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 0.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}

I'm able to consistently reproduce this issue on Colab and my local environment. Here is a reproduction of the issue using normal python scripts, ensuring python fully exits between each invocation of test.py: https://github.com/jchook/peft-issue-2171

You can see the first 100 test.py invocation results.

JINO-ROHIT · 2024-10-22T18:31:21Z

ahhh i see , not sure why then, the other bits of the code seem mostly okay, did pushing to hub and load help?

ill try and see if something else works, meanwhile we can wait for @BenjaminBossan

jchook · 2024-10-22T19:07:41Z

did pushing to hub and load help?

Apparently saving/loading from the Hub does not fix the issue either. See this repro on Colab.

Final results:
{'accuracy': 0.0}
{'accuracy': 0.525}
{'accuracy': 0.0}
{'accuracy': 0.99}
{'accuracy': 0.0}
{'accuracy': 0.9}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 0.73}
{'accuracy': 1.0}

However, changing the base model from DeBERTa V3 to RoBERTa seems to resolve the issue. So the problem may be DeBERTa-specific somehow.

Final results:
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}
{'accuracy': 1.0}

jchook changed the title ~~LoRA for CLS_SEQ - loading from file gives erratic, non-deterministic results~~ LoRA for SEQ_CLS - loading from file gives erratic, non-deterministic results Oct 22, 2024

jchook changed the title ~~LoRA for SEQ_CLS - loading from file gives erratic, non-deterministic results~~ LoRA - loading from file gives erratic, non-deterministic results Oct 22, 2024

jchook changed the title ~~LoRA - loading from file gives erratic, non-deterministic results~~ LoRA + DeBERTA: loading model gives erratic, non-deterministic results Oct 22, 2024

jchook changed the title ~~LoRA + DeBERTA: loading model gives erratic, non-deterministic results~~ LoRA + DeBERTa: loading model gives erratic, non-deterministic results Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA + DeBERTa: loading model gives erratic, non-deterministic results #2171

LoRA + DeBERTa: loading model gives erratic, non-deterministic results #2171

jchook commented Oct 22, 2024 •

edited

Loading

JINO-ROHIT commented Oct 22, 2024

jchook commented Oct 22, 2024

JINO-ROHIT commented Oct 22, 2024

jchook commented Oct 22, 2024

Faster execution for the sake of time

Ensure your run loop re-loads the LoRA from disk each time

JINO-ROHIT commented Oct 22, 2024

jchook commented Oct 22, 2024

LoRA + DeBERTa: loading model gives erratic, non-deterministic results #2171

LoRA + DeBERTa: loading model gives erratic, non-deterministic results #2171

Comments

jchook commented Oct 22, 2024 • edited Loading

System Info

Issue

Who can help?

Information

Tasks

Reproduction

Expected behavior

JINO-ROHIT commented Oct 22, 2024

jchook commented Oct 22, 2024

JINO-ROHIT commented Oct 22, 2024

jchook commented Oct 22, 2024

Faster execution for the sake of time

Ensure your run loop re-loads the LoRA from disk each time

JINO-ROHIT commented Oct 22, 2024

jchook commented Oct 22, 2024

jchook commented Oct 22, 2024 •

edited

Loading