`Dataset.map()` causes cache miss/fingerprint change when closure captures self containing non-deterministic state.

### Describe the bug

When using `.map()` with a function defined inside a **class (of which has any non-deterministic states)** method (a closure), if that function captures `self` to access a configuration variable (e.g., self.foo), the fingerprint mechanism serializes the entire class instance state.

If the class instance contains any non-deterministic state (such as random seeds, loggers, or distinct object IDs—in my case, PyTorch Lightning's `LightningDataModule`), the fingerprint changes on every run, rendering the cache useless.

While this may be intended behavior for `dill`, it is a significant "gotcha" for users migrating code into classes, as unrelated state changes cause massive re-processing overhead.

Real world "cache explosion" screenshot caused by the fingerprint mismatch:
<img width="942" height="382" alt="Image" src="https://github.com/user-attachments/assets/2fb0acba-ac07-4f00-bf30-c1ac932c9072" />

### Steps to reproduce the bug

Minimal reproduction code block:
```python3
import datasets
import uuid

# Prevent logging spam
datasets.logging.set_verbosity_error()

class ReproduceIssue:
    def __init__(self):
        # This is the variable we actually care about in the map function
        self.foo = 32
        
        # This simulates "dirty" internal state often found in framework classes 
        # (e.g., unique IDs, pointers to loggers, thread locks, or random seeds)
        self.hidden_state = uuid.uuid4()
        
        self.dataset = datasets.Dataset.from_dict({"strokes": [1, 2, 3]})

    def setup(self):
        # Closure captures 'self' to access 'self.foo'
        def preprocess(batch):
            # Accessing self binds the function to the specific instance state
            _ = self.foo 
            return {"foo": batch["bar"]}

        return self.dataset.map(preprocess, batched=True)

print("--- Run 1 ---")
inst1 = ReproduceIssue()
ds1 = inst1.setup()
print(f"Fingerprint 1: {ds1._fingerprint}")

print("\n--- Run 2 (New Instance) ---")
inst2 = ReproduceIssue()
ds2 = inst2.setup()
print(f"Fingerprint 2: {ds2._fingerprint}")

if ds1._fingerprint != ds2._fingerprint:
    print("\n❌ ISSUE REPRODUCED: Fingerprints differ (Cache Miss).")
else:
    print("\n✅ Fingerprints match.")
```
Result:
```
--- Run 1 ---
Mapping: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2025.26 examples/s]
Fingerprint 1: 1ce6104f9e97912a

--- Run 2 (New Instance) ---
Mapping: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2300.77 examples/s]
Fingerprint 2: c0fc011ff86ea571

--- Result ---
❌ CACHE MISS: Fingerprints are different!
```

### Expected behavior

The fingerprint should ideally depend **only on the bytecode of the function and the values of the variables actually accessed** (`self.foo`), rather than the state of the whole object self.

### Environment info

datasets version: 4.5.0, platform: any, python version: 3.13.
This was encountered while subclassing torch lightning's `LightningDataModule`. These objects inherently **contain internal state that differs per instance**.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Dataset.map()` causes cache miss/fingerprint change when closure captures self containing non-deterministic state. #7986

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset.map() causes cache miss/fingerprint change when closure captures self containing non-deterministic state. #7986

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Dataset.map()` causes cache miss/fingerprint change when closure captures self containing non-deterministic state. #7986