Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training output reports incorrect num examples when using DDP #683

Open
2 of 4 tasks
syl-taylor-aws opened this issue Aug 24, 2024 · 1 comment
Open
2 of 4 tasks
Labels
bug Something isn't working Stale

Comments

@syl-taylor-aws
Copy link

System Info

AWS EC2 instance: trn1.32xlarge
OS: Ubuntu 22.04.4 LTS

Platform:

- Platform: Linux-6.5.0-1023-aws-x86_64-with-glibc2.35
- Python version: 3.10.12

Python packages:

- `optimum-neuron` version: 0.0.24
- `neuron-sdk` version: 2.19.1
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.24.5
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.2335
- `neuronx-cc` version: 2.14.227.0+2d4f85be
- `neuronx-distributed` version: 0.8.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21

Neuron Driver:
aws-neuronx-collectives/unknown,now 2.21.46.0-69b77134b amd64 [installed]
aws-neuronx-dkms/unknown,now 2.17.17.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.21.41.0-fb1705f5f amd64 [installed]
aws-neuronx-tools/unknown,now 2.18.3.0 amd64 [installed]

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I can't share the project code which has a dataset of type Dataset and len of 56403, but wrote another case for simplicity, that shows the same issue.

Command: torchrun --nproc_per_node=2 issue.py

Code (issue.py)
import torch
from transformers import RobertaForCausalLM
from optimum.neuron import NeuronTrainer as Trainer
from optimum.neuron import NeuronTrainingArguments as TrainingArguments


class CustomDataset(torch.utils.data.Dataset):
    def __getitem__(self, index):
        return {
            "input_ids": torch.randint(0, 50265, (512,)),
            "labels": torch.randint(0, 50265, (512,))
        }

    def __len__(self):
        return 56403

dataset = CustomDataset()

model = RobertaForCausalLM.from_pretrained("roberta-base")

training_args = TrainingArguments(output_dir="./model", max_steps=100)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

trainer.train() # note the output line: "[INFO|trainers.py:<num>] <timestamp> >>   Num examples = <number>""
# the issue is at https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 
# currently "self.num_examples(train_dataloader)" = 28208
# should maybe be "self.num_examples(train_dataloader._loader)" = 56403 (expected)

When calling trainer.train(), we get the output:

[INFO|trainers.py:] <timestamp> >> ***** Running training *****
[INFO|trainers.py:] <timestamp> >>   Num examples = 28,208
...

Num examples should be 56403, but it returns a different number when using DDP, like 28208 (in this test).

"Num examples" is calculated by Trainer's num_examples() in https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1408 which is called by https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 .

The issue doesn't occur when training without DDP. Without DDP, dataloader is <torch.utils.data.dataloader.DataLoader> and num_examples() returns expected number.

With DDP, dataloader is <torch_xla.distributed.parallel_loader.MpDeviceLoader> and dataloader.dataset raises AttributeError("'MpDeviceLoader' object has no attribute 'dataset'"). This makes num_examples() return an unexpected number at https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1420 . However, we have dataloader._loader which is a <torch.utils.data.dataloader.DataLoader> and len(dataloader._loader.dataset) is 56403. Perhaps we should call self.num_examples(train_dataloader._loader)?

Expected behavior

"Num examples" is not reported correctly when training with DDP on AWS Trainium/Inferentia instances. In the reproducible code, it should be 56403 (len of dataset), but it returns 28208 based on an exception occurring in num_examples() in the transformers package.

For additional reference, on a EC2 p4d instance (Nvidia A100 GPUs), when using DDP with the Trainer from the transformers package, dataloader is <accelerate.data_loader.DataLoaderShard> and "num examples" is reported as expected: 56403.

@syl-taylor-aws syl-taylor-aws added the bug Something isn't working label Aug 24, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

1 participant