save_state() and load_state() do not work correctly with multi-gpu with shuffle=True in dataloader #3158

isayoften · 2024-10-11T15:21:15Z

System Info

- `Accelerate` version: 1.0.0
- Platform: Linux-5.15.154+-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 31.36 GB
- GPU type: Tesla T4
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

https://www.kaggle.com/code/amanattheedge/demonstration

Expected behavior

Maybe I'm doing something wrong, but save_state() and load_state() should memorize the RNG states so that the previous shuffling of data within the new epoch can be restored.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save_state() and load_state() do not work correctly with multi-gpu with shuffle=True in dataloader #3158

save_state() and load_state() do not work correctly with multi-gpu with shuffle=True in dataloader #3158

isayoften commented Oct 11, 2024

save_state() and load_state() do not work correctly with multi-gpu with shuffle=True in dataloader #3158

save_state() and load_state() do not work correctly with multi-gpu with shuffle=True in dataloader #3158

Comments

isayoften commented Oct 11, 2024

System Info

Information

Tasks

Reproduction

Expected behavior