Skip to content

Dataset.map crashes when first examples return None and later examples return dict — writer not initialized #7990

@meta-program

Description

@meta-program

Describe the bug

I detected a serious bug from datasets/arrow_dataset.py

Description of the bug
Dataset.map crashes with writer is None when the map function returns None for the first few examples and a dictionary (or pa.Table / DataFrame) for later examples. This happens because the internal writer is initialized only when i == 0 (or i[0] == 0 in batched mode), but update_data is determined lazily after processing the first example/batch.

Steps to reproduce

from datasets import Dataset

ds = Dataset.from_dict({"x": [1, 2, 3]})

def fn(example, idx):
    if idx < 2:
        return None
    return {"x": [example["x"] * 10]}

list(ds.map(fn, with_indices=True))

Expected behavior

  • The function should work regardless of when update_data becomes True.
  • Writer should be initialized the first time a non-None return occurs, not tied to the first index.

Environment info

  • datasets version:
  • Python version: 3.12
  • OS:

Suggested fix
Replace if i == 0 / if i[0] == 0 checks with if writer is None when initializing the writer.


Steps to reproduce the bug

Here's a ready-to-use version you can paste into that section:


Steps to reproduce the bug

from datasets import Dataset

# Create a minimal dataset
ds = Dataset.from_dict({"x": [1, 2, 3]})

# Define a map function that returns None for first examples, dict later
def fn(example, idx):
    if idx < 2:
        return None
    return {"x": [example["x"] * 10]}

# Apply map with indices
list(ds.map(fn, with_indices=True))

Expected: function executes without errors.
Observed: crashes with AttributeError: 'NoneType' object has no attribute 'write' because the internal writer is not initialized when the first non-None return happens after i > 0.


This is minimal and clearly demonstrates the exact failure condition (None early, dict later).

Expected behavior


Expected behavior
The Dataset.map function should handle map functions that return None for some examples and a dictionary (or pa.Table / DataFrame) for later examples. In this case, the internal writer should be initialized when the first non-None value is returned, so that the dataset can be updated without crashing. The code should run successfully for all examples and return the updated dataset.


Environment info

  • python3.12
  • datasets==3.6.0 [but the latest version still has this problem]
  • transformers==4.55.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions