Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data shuffle with streaming #6326

Closed
1 task done
JonghwanMun opened this issue Dec 13, 2024 · 1 comment
Closed
1 task done

Data shuffle with streaming #6326

JonghwanMun opened this issue Dec 13, 2024 · 1 comment
Labels
solved This problem has been already solved

Comments

@JonghwanMun
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

No response

Reproduction

I wonder how data shuffling works when using streaming option.
I understand that data shuffling is applied on each buffer.

If I have total 10,000 data and set buffer_size as 1,000.

  • Let's consider buffer1 (1-1000 samples), buffer2 (1001-2000 samples), buffer3 (2001-3000 samples), ...
  • Then, the buffer order is shuffled or not?

Expected behavior

The buffer order is shuffled.

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 13, 2024
@hiyouga
Copy link
Owner

hiyouga commented Dec 14, 2024

You can refer to the document for details
https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset.shuffle
https://huggingface.co/docs/datasets/en/stream#shuffle
If your dataset has multiple shards, the order of shards will be shuffled also.

@hiyouga hiyouga closed this as completed Dec 14, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants