Skip to content

Dataset.map() causes cache miss/fingerprint change when closure captures self containing non-deterministic state. #7986

@Cloud0310

Description

@Cloud0310

Describe the bug

When using .map() with a function defined inside a class (of which has any non-deterministic states) method (a closure), if that function captures self to access a configuration variable (e.g., self.foo), the fingerprint mechanism serializes the entire class instance state.

If the class instance contains any non-deterministic state (such as random seeds, loggers, or distinct object IDs—in my case, PyTorch Lightning's LightningDataModule), the fingerprint changes on every run, rendering the cache useless.

While this may be intended behavior for dill, it is a significant "gotcha" for users migrating code into classes, as unrelated state changes cause massive re-processing overhead.

Real world "cache explosion" screenshot caused by the fingerprint mismatch:
Image

Steps to reproduce the bug

Minimal reproduction code block:

import datasets
import uuid

# Prevent logging spam
datasets.logging.set_verbosity_error()

class ReproduceIssue:
    def __init__(self):
        # This is the variable we actually care about in the map function
        self.foo = 32
        
        # This simulates "dirty" internal state often found in framework classes 
        # (e.g., unique IDs, pointers to loggers, thread locks, or random seeds)
        self.hidden_state = uuid.uuid4()
        
        self.dataset = datasets.Dataset.from_dict({"strokes": [1, 2, 3]})

    def setup(self):
        # Closure captures 'self' to access 'self.foo'
        def preprocess(batch):
            # Accessing self binds the function to the specific instance state
            _ = self.foo 
            return {"foo": batch["bar"]}

        return self.dataset.map(preprocess, batched=True)

print("--- Run 1 ---")
inst1 = ReproduceIssue()
ds1 = inst1.setup()
print(f"Fingerprint 1: {ds1._fingerprint}")

print("\n--- Run 2 (New Instance) ---")
inst2 = ReproduceIssue()
ds2 = inst2.setup()
print(f"Fingerprint 2: {ds2._fingerprint}")

if ds1._fingerprint != ds2._fingerprint:
    print("\n❌ ISSUE REPRODUCED: Fingerprints differ (Cache Miss).")
else:
    print("\n✅ Fingerprints match.")

Result:

--- Run 1 ---
Mapping: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2025.26 examples/s]
Fingerprint 1: 1ce6104f9e97912a

--- Run 2 (New Instance) ---
Mapping: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2300.77 examples/s]
Fingerprint 2: c0fc011ff86ea571

--- Result ---
❌ CACHE MISS: Fingerprints are different!

Expected behavior

The fingerprint should ideally depend only on the bytecode of the function and the values of the variables actually accessed (self.foo), rather than the state of the whole object self.

Environment info

datasets version: 4.5.0, platform: any, python version: 3.13.
This was encountered while subclassing torch lightning's LightningDataModule. These objects inherently contain internal state that differs per instance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions