Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow iteration for iterable dataset with numpy formatting for array data #7206

Open
alex-hh opened this issue Oct 8, 2024 · 1 comment
Open

Comments

@alex-hh
Copy link
Contributor

alex-hh commented Oct 8, 2024

Describe the bug

When working with large arrays, setting with_format to e.g. numpy then applying map causes a significant slowdown for iterable datasets.

Steps to reproduce the bug

import numpy as np
import time
from datasets import Dataset, Features, Array3D

features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)

Then

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()
print(t1-t0)

takes 27 s, whereas

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy")
ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()
print(t1 - t0)

takes ~1s

Expected behavior

Map should not introduce a slowdown when formatting is enabled.

Environment info

3.0.2

@tux-type
Copy link

The below easily eats up 32G of RAM. Leaving it for a while bricked the laptop with 16GB.

dataset = load_dataset("Voxel51/OxfordFlowers102", data_dir="data").with_format("numpy")
processed_dataset = dataset.map(lambda x: x)

image

Similar problems occur if using a real transform function in .map().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants