-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Describe the bug
When using .map() with a function defined inside a class (of which has any non-deterministic states) method (a closure), if that function captures self to access a configuration variable (e.g., self.foo), the fingerprint mechanism serializes the entire class instance state.
If the class instance contains any non-deterministic state (such as random seeds, loggers, or distinct object IDs—in my case, PyTorch Lightning's LightningDataModule), the fingerprint changes on every run, rendering the cache useless.
While this may be intended behavior for dill, it is a significant "gotcha" for users migrating code into classes, as unrelated state changes cause massive re-processing overhead.
Real world "cache explosion" screenshot caused by the fingerprint mismatch:

Steps to reproduce the bug
Minimal reproduction code block:
import datasets
import uuid
# Prevent logging spam
datasets.logging.set_verbosity_error()
class ReproduceIssue:
def __init__(self):
# This is the variable we actually care about in the map function
self.foo = 32
# This simulates "dirty" internal state often found in framework classes
# (e.g., unique IDs, pointers to loggers, thread locks, or random seeds)
self.hidden_state = uuid.uuid4()
self.dataset = datasets.Dataset.from_dict({"strokes": [1, 2, 3]})
def setup(self):
# Closure captures 'self' to access 'self.foo'
def preprocess(batch):
# Accessing self binds the function to the specific instance state
_ = self.foo
return {"foo": batch["bar"]}
return self.dataset.map(preprocess, batched=True)
print("--- Run 1 ---")
inst1 = ReproduceIssue()
ds1 = inst1.setup()
print(f"Fingerprint 1: {ds1._fingerprint}")
print("\n--- Run 2 (New Instance) ---")
inst2 = ReproduceIssue()
ds2 = inst2.setup()
print(f"Fingerprint 2: {ds2._fingerprint}")
if ds1._fingerprint != ds2._fingerprint:
print("\n❌ ISSUE REPRODUCED: Fingerprints differ (Cache Miss).")
else:
print("\n✅ Fingerprints match.")Result:
--- Run 1 ---
Mapping: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2025.26 examples/s]
Fingerprint 1: 1ce6104f9e97912a
--- Run 2 (New Instance) ---
Mapping: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2300.77 examples/s]
Fingerprint 2: c0fc011ff86ea571
--- Result ---
❌ CACHE MISS: Fingerprints are different!
Expected behavior
The fingerprint should ideally depend only on the bytecode of the function and the values of the variables actually accessed (self.foo), rather than the state of the whole object self.
Environment info
datasets version: 4.5.0, platform: any, python version: 3.13.
This was encountered while subclassing torch lightning's LightningDataModule. These objects inherently contain internal state that differs per instance.