-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Describe the bug
I detected a serious bug from datasets/arrow_dataset.py
Description of the bug
Dataset.map crashes with writer is None when the map function returns None for the first few examples and a dictionary (or pa.Table / DataFrame) for later examples. This happens because the internal writer is initialized only when i == 0 (or i[0] == 0 in batched mode), but update_data is determined lazily after processing the first example/batch.
Steps to reproduce
from datasets import Dataset
ds = Dataset.from_dict({"x": [1, 2, 3]})
def fn(example, idx):
if idx < 2:
return None
return {"x": [example["x"] * 10]}
list(ds.map(fn, with_indices=True))Expected behavior
- The function should work regardless of when
update_databecomesTrue. - Writer should be initialized the first time a non-
Nonereturn occurs, not tied to the first index.
Environment info
datasetsversion:- Python version: 3.12
- OS:
Suggested fix
Replace if i == 0 / if i[0] == 0 checks with if writer is None when initializing the writer.
Steps to reproduce the bug
Here's a ready-to-use version you can paste into that section:
Steps to reproduce the bug
from datasets import Dataset
# Create a minimal dataset
ds = Dataset.from_dict({"x": [1, 2, 3]})
# Define a map function that returns None for first examples, dict later
def fn(example, idx):
if idx < 2:
return None
return {"x": [example["x"] * 10]}
# Apply map with indices
list(ds.map(fn, with_indices=True))Expected: function executes without errors.
Observed: crashes with AttributeError: 'NoneType' object has no attribute 'write' because the internal writer is not initialized when the first non-None return happens after i > 0.
This is minimal and clearly demonstrates the exact failure condition (None early, dict later).
Expected behavior
Expected behavior
The Dataset.map function should handle map functions that return None for some examples and a dictionary (or pa.Table / DataFrame) for later examples. In this case, the internal writer should be initialized when the first non-None value is returned, so that the dataset can be updated without crashing. The code should run successfully for all examples and return the updated dataset.
Environment info
- python3.12
- datasets==3.6.0 [but the latest version still has this problem]
- transformers==4.55.2