Low performance on mps backed #2041

yaroslavyaroslav · 2024-08-27T14:39:37Z

System Info

accelerate         0.33.0
peft               0.12.0
Python 3.12.5
macOS: 15.0
MacBook pro M1 Pro 16 gb

Who can help?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
from peft import get_peft_model, LoraConfig
from transformers import TrainerCallback
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
print(torch.backends.mps.is_available())   # Should return True if MPS is available
print(torch.backends.mps.is_built())

torch.cuda.empty_cache()
# Load Gemma2-2B model and tokenizer
model_name = "mlx-community/gemma-2-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation='eager')

# Set the device to MPS (Apple Silicon)
device = torch.device("mps")
model.to(device)
print(next(model.parameters()).device)  # Should show "mps:0"

# Prepare the model for PEFT using LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=[
        "self_attn.k_proj",
        "self_attn.q_proj",
        "self_attn.v_proj",
        "self_attn.o_proj",
        "mlp.down_proj",
        "mlp.gate_proj",
        "mlp.up_proj"
    ],  # Specify your target modules here
)
model = get_peft_model(model, lora_config)

# Load your JSONL datasets

dataset = load_dataset('json', data_files={
    'train': '/path-to-dataset/user_data/train.jsonl',
    'test': '/path-to-dataset/user_data/test.jsonl',
    'validation': '/path-to-dataset/user_data/valid.jsonl'
})

# Ensure the columns are named 'input' for compatibility
dataset = dataset.rename_column("text", "input")

# Tokenization function
def tokenize_function(examples):
   input_ids = tokenizer(examples["input"], truncation=True, padding="max_length", max_length=128)["input_ids"]
   labels = [[-100] + ids[:-1] for ids in input_ids]  # Shift input_ids for labels
   return {"input_ids": input_ids, "labels": labels}

# Tokenize the datasets
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Remove the original input column if no longer needed
tokenized_datasets = tokenized_datasets.remove_columns(["input"])

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=25,
    eval_strategy="steps",
    gradient_accumulation_steps=2,  # Simulate larger batch size
    eval_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model='loss',  # or choose your preferred metric
    report_to="none"
)

# Example of using logging in your custom Trainer or adding callbacks
class LoggingCallback(TrainerCallback):
    def on_epoch_end(self, args, state, control, **kwargs):
        logging.info(f"Epoch {state.epoch} completed.")

    def on_step_end(self, args, state, control, **kwargs):
        logging.info(f"Step {state.global_step} completed.")

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],  # Include validation dataset
    callbacks=[LoggingCallback()]
)

# Start fine-tuning
trainer.train()

Expected behavior

This pipeline utilises gpu on 10-15% meanwhile cpu is utilised 30-50%.

mlx framework with the quite same lora train setup on the same model utilises gpu twice to third times more.

Such low utilisation leads to quite a slow training progress in comparison to mlx one.

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2024-08-27T15:08:13Z

Thanks for reporting this issue.

I don't have a Mac to try to reproduce the issue, so I cannot really help you here. Honestly, I don't know much about MPS in general and how well it is supported by PyTorch. Still, maybe you could provide some further information and maybe other users who see this issue can give further advise:

Is this observation specific to PEFT training? For instance, if you do full fine-tuning, do you see that the difference disappears?
How much slower is the training on the same task?
Could you provide the code for the MLX training?

PS: Please don't ping "saya", they're not related to this project.

yaroslavyaroslav · 2024-08-27T16:04:51Z

how well it is supported by PyTorch

Honestly, it's awful. I mean, it is presented in some sense, but the list of missed primitives for mps is huge and it doesn't seem getting any shorter. So if this lib leveraging PyTorch as mps backed — it's a bad luck for me.

But anyway I raised this one as a starting point, because as long as I get it, this lib leveraging accelerate lib which is smth like back-end managing layer for all of the different gpu related stuff, and if so it's quite clear that the pain point comes from there. Am I right with that?

ps: sorry for saya mention it was gh completion failure 😅

BenjaminBossan · 2024-08-28T09:23:12Z

So what you're saying is that the slowness stems from the lackluster support of MPS in PyTorch, and as PEFT is using PyTorch, slow MPS performance is expected. Is that right? If there are specific operations in PEFT that could be replaced with alternatives that are more efficient for MPS, let us know, apart from that I don't think there is much we can do.

this lib leveraging accelerate lib which is smth like back-end managing layer for all of the different gpu related stuff

I'd say not quite. PEFT uses a few functions from accelerate, e.g. for moving tensors on and off devices if the base model requires it, but apart from that is pretty much independent of accelerate. Also, accelerate is not so much a "back-end managing layer of the different gpu related stuff", but more so an library for providing a seamless integration of (mostly) training features, like parallelization, dealing with large models, mixed precision, etc. Managing devices is just a "side effect" of dealing with those.

If you use PEFT with Trainer from transformers, that will also use accelerate under the hood, but if you use something else for training, accelerate won't be involved. But in the end, I don't think that accelerate has really any bearing on the performance of MPS.

yaroslavyaroslav · 2024-08-31T15:41:59Z

If there are specific operations in PEFT that could be replaced with alternatives that are more efficient for MPS

Yeah, I'm gonna to dig this thing through sooner than later to get what operations fall back on cpu in peft.

Thank you for the thorough overview of peft stack, it would help on the next step of debugging thing.

So not sure about long open issues treatment here, I'd keep it open that be an eyesore for me, but it's annoying for you feel free to close it, I'll open new one later.

BenjaminBossan · 2024-09-02T09:15:29Z

All right. We can keep this open for the time being, maybe it helps get some eyes on the topic.

github-actions · 2024-09-26T15:03:46Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

yaroslavyaroslav · 2024-09-29T18:39:17Z

Cmon dude stop pushing me, I'm doing what I can

huggingface deleted a comment Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance on mps backed #2041

Low performance on mps backed #2041

yaroslavyaroslav commented Aug 27, 2024 •

edited

Loading

BenjaminBossan commented Aug 27, 2024

yaroslavyaroslav commented Aug 27, 2024 •

edited

Loading

BenjaminBossan commented Aug 28, 2024

yaroslavyaroslav commented Aug 31, 2024

BenjaminBossan commented Sep 2, 2024

github-actions bot commented Sep 26, 2024

yaroslavyaroslav commented Sep 29, 2024

Low performance on mps backed #2041

Low performance on mps backed #2041

Comments

yaroslavyaroslav commented Aug 27, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

BenjaminBossan commented Aug 27, 2024

yaroslavyaroslav commented Aug 27, 2024 • edited Loading

BenjaminBossan commented Aug 28, 2024

yaroslavyaroslav commented Aug 31, 2024

BenjaminBossan commented Sep 2, 2024

github-actions bot commented Sep 26, 2024

yaroslavyaroslav commented Sep 29, 2024

yaroslavyaroslav commented Aug 27, 2024 •

edited

Loading

yaroslavyaroslav commented Aug 27, 2024 •

edited

Loading