Skip to content

Conversation

@szkarpinski
Copy link
Collaborator

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

Pipeline checkpoints are simple enough that the safe JSON can be used instead of Pickle.

Additional information:

Affected modules and functionalities:

Internal format of the checkpoints

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4546

@szkarpinski
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41566518]: BUILD STARTED

@greptile-apps
Copy link

greptile-apps bot commented Jan 12, 2026

Greptile Summary

This PR refactors pipeline checkpointing from pickle to JSON serialization, improving security by eliminating arbitrary code execution risks. The implementation adds type-aware serialization/deserialization methods to handle numpy arrays and int64 values correctly, converting them to JSON-compatible types (lists and ints) during save and restoring them with proper dtypes during load. Both pipeline and iterator checkpoint handling now include JSONDecodeError handling that provides clear error messages when encountering old pickle-based checkpoints.

  • Replaced pickle.dumps()/pickle.loads() with json.dumps()/json.loads() in both pipeline and iterator checkpoint methods
  • Added _serialize_value() and _deserialize_value() helper methods to handle numpy type conversions
  • Updated _checkpointed_fields() to include type information for each field
  • Added proper error handling for JSONDecodeError with informative messages about version compatibility
  • Removed nosec security suppression comments that were needed for pickle

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The migration from pickle to JSON is well-implemented with proper type handling for numpy arrays and int64 values. The serialization/deserialization logic correctly converts between JSON-compatible types and numpy types. Error handling is appropriate for detecting incompatible checkpoints. The change is a security improvement that removes arbitrary code execution risks. Previous review concerns about type mismatches have been addressed.
  • No files require special attention

Important Files Changed

Filename Overview
dali/python/nvidia/dali/pipeline.py Migrated pipeline checkpoint serialization from pickle to JSON with proper error handling for decode failures
dali/python/nvidia/dali/plugin/base_iterator.py Migrated iterator checkpoint serialization from pickle to JSON with type-aware serialization/deserialization for numpy arrays and int64 values

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline
    participant Iterator
    participant Checkpoint
    
    Note over User,Checkpoint: Saving Checkpoint
    User->>Pipeline: checkpoint()
    Pipeline->>Iterator: _save_state()
    Iterator->>Iterator: _serialize_value() for each field
    Note over Iterator: Convert np.int64 to int<br/>Convert np.ndarray to list
    Iterator-->>Pipeline: JSON string (iterator_data)
    Pipeline->>Pipeline: json.dumps(pipeline_data)
    Note over Pipeline: Serialize epoch_idx and iter
    Pipeline->>Checkpoint: GetSerializedCheckpoint()
    Checkpoint-->>User: Serialized checkpoint
    
    Note over User,Checkpoint: Restoring Checkpoint
    User->>Pipeline: Pipeline(checkpoint=data)
    Pipeline->>Pipeline: _restore_state_from_checkpoint()
    Pipeline->>Pipeline: json.loads(pipeline_data)
    alt JSON decode success
        Pipeline->>Pipeline: Restore epoch_idx and iter
        Pipeline->>Iterator: _restore_state(iterator_data)
        Iterator->>Iterator: json.loads(iterator_data)
        alt JSON decode success
            Iterator->>Iterator: _deserialize_value() for each field
            Note over Iterator: Convert int to np.int64<br/>Convert list to np.ndarray
            Iterator->>Iterator: setattr() for each field
        else JSON decode fails
            Iterator-->>User: ValueError (version mismatch)
        end
    else JSON decode fails
        Pipeline-->>User: ValueError (version mismatch)
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 272 to 275
iterator_data = json.loads(iterator_data) # nosec B301
for field in self._checkpointed_fields():
if hasattr(self, field):
setattr(self, field, iterator_data[field])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the serialization issue in _save_state() is fixed by converting numpy arrays to lists, this restoration code has a critical type mismatch bug.

When json.loads() deserializes the data, numpy arrays will be restored as Python lists, not numpy arrays. The code then directly assigns these lists to fields that are expected to be numpy arrays (e.g., _counter_per_gpu, _shard_sizes_per_gpu, _shards_id).

This will cause failures when the code later tries to use these fields with numpy operations. For example:

  • Line 435: np.any(self._counter_per_gpu + counter > self._shard_sizes_per_gpu) - will fail if these are lists
  • Line 502: self._counter_per_gpu = self._counter_per_gpu + self._counter - numpy array addition fails with lists
  • Line 518: self._shards_id = (self._shards_id + 1) % self._shards_num - fails if _shards_id is a list

The restoration needs to convert lists back to numpy arrays with the correct dtype:

for field in self._checkpointed_fields():
    if hasattr(self, field):
        value = iterator_data[field]
        if field in ["_counter_per_gpu", "_shard_sizes_per_gpu", "_shards_id"] and isinstance(value, list):
            value = np.array(value, dtype=np.int64)
        setattr(self, field, value)

@greptile-apps
Copy link

greptile-apps bot commented Jan 12, 2026

Additional Comments (1)

dali/python/nvidia/dali/plugin/base_iterator.py
This json.dumps() call will fail with a TypeError when iterator state includes numpy arrays.

The _checkpointed_fields() method returns fields including _counter_per_gpu, _shard_sizes_per_gpu, and _shards_id, which are all numpy arrays (see lines 326, 344-345). These arrays are created with dtype=np.int64 and cannot be serialized by the standard json module.

Example failure scenario:

import json
import numpy as np
data = {"_shards_id": np.array([0, 1], dtype=np.int64)}
json.dumps(data)  # TypeError: Object of type ndarray is not JSON serializable

The previous pickle.dumps() implementation handled numpy arrays correctly. To use JSON, you need to:

  1. Convert numpy arrays to lists before serialization: json.dumps({field: getattr(self, field).tolist() if isinstance(getattr(self, field), np.ndarray) else getattr(self, field) ...})
  2. Convert lists back to numpy arrays with the correct dtype after deserialization in _restore_state()

This will break any code path that uses reader_name parameter with checkpointing, as that's when these numpy array fields are populated.

@greptile-apps
Copy link

greptile-apps bot commented Jan 12, 2026

Greptile Overview

Greptile Summary

This PR attempts to replace pickle with JSON for pipeline checkpoint serialization to improve safety. However, the implementation has a critical bug that will cause runtime failures.

Key Issues

Critical: NumPy Array Serialization Failure

The most severe issue is in base_iterator.py. When iterators are created with the reader_name parameter, several fields are stored as NumPy arrays (_counter_per_gpu, _shard_sizes_per_gpu, _shards_id). The standard JSON encoder cannot serialize NumPy arrays, causing a TypeError at runtime when attempting to checkpoint.

Example failure path:

  1. Create iterator with reader_name="Reader"
  2. NumPy arrays are initialized (lines 327, 345-346 in base_iterator.py)
  3. Call checkpoints() method
  4. json.dumps() fails with: TypeError: Object of type ndarray is not JSON serializable

Breaking Change: No Backward Compatibility

The PR provides no migration path for existing checkpoints. Checkpoints saved with pickle (previous version) cannot be loaded with the new JSON-based code, as json.loads() will fail with JSONDecodeError when encountering pickled binary data.

What Works

The changes in pipeline.py are safer because they only serialize simple integer fields (iter and epoch_idx), which JSON handles correctly.

Recommendations

  1. Must fix: Convert NumPy arrays to lists using .tolist() before JSON serialization, or use a custom JSON encoder
  2. Should add: Backward compatibility fallback to pickle.loads() for existing checkpoints
  3. Consider: Only migrate pipeline_data to JSON while keeping iterator_data with pickle, since iterator_data contains complex NumPy types

Confidence Score: 0/5

  • This PR has critical bugs that will cause runtime failures and should not be merged in its current state
  • Score of 0 reflects a critical logic error that will cause TypeErrors when checkpointing with reader_name parameter. The existing test suite uses reader_name="Reader" extensively, so this issue will be caught by tests. Additionally, the PR breaks backward compatibility without any migration strategy.
  • dali/python/nvidia/dali/plugin/base_iterator.py requires immediate attention - the numpy array serialization issue must be fixed before merge

Important Files Changed

File Analysis

Filename Score Overview
dali/python/nvidia/dali/plugin/base_iterator.py 0/5 Replaced pickle with json for serialization, but fails to handle numpy arrays which are not JSON-serializable - will cause TypeError at runtime
dali/python/nvidia/dali/pipeline.py 2/5 Replaced pickle with json for simple integer fields (epoch_idx, iter), but breaks backward compatibility with old checkpoints

Sequence Diagram

sequenceDiagram
    participant User
    participant Iterator as DaliBaseIterator
    participant Pipeline
    participant Backend as C++ Backend
    
    Note over User,Backend: Checkpointing Flow (with reader_name)
    
    User->>Iterator: checkpoints()
    Iterator->>Iterator: _save_state()
    Note over Iterator: Collects fields:<br/>_counter (int)<br/>_counter_per_gpu (numpy array)<br/>_shard_sizes_per_gpu (numpy array)<br/>_shards_id (numpy array)<br/>_size (int)
    Iterator->>Iterator: json.dumps(fields)
    Note over Iterator: ❌ TypeError!<br/>NumPy arrays not JSON serializable
    
    Note over User,Backend: Correct Flow (should be)
    
    User->>Iterator: checkpoints()
    Iterator->>Iterator: _save_state()
    Iterator->>Iterator: Convert numpy arrays to lists
    Iterator->>Iterator: json.dumps(fields)
    Iterator-->>Pipeline: iterator_data (JSON string)
    
    Pipeline->>Pipeline: _get_checkpoint(iterator_data)
    Pipeline->>Pipeline: json.dumps({"iter": ..., "epoch_idx": ...})
    Pipeline-->>Backend: ExternalContextCheckpoint
    Note over Backend: pipeline_data (JSON string)<br/>iterator_data (JSON string)
    Backend->>Backend: GetSerializedCheckpoint()
    Backend-->>User: Checkpoint bytes
    
    Note over User,Backend: Restoration Flow
    
    User->>Pipeline: Pipeline(checkpoint=...)
    Pipeline->>Backend: RestoreFromSerializedCheckpoint()
    Backend-->>Pipeline: ExternalContextCheckpoint
    Pipeline->>Pipeline: json.loads(pipeline_data)
    Pipeline->>Iterator: _restore_state(iterator_data)
    Iterator->>Iterator: json.loads(iterator_data)
    Iterator->>Iterator: Restore fields from dict
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link

greptile-apps bot commented Jan 12, 2026

Additional Comments (1)

dali/python/nvidia/dali/plugin/base_iterator.py
The change from pickle.dumps() to json.dumps() will cause a TypeError when checkpointing iterators that use the reader_name parameter. The checkpointed fields include numpy arrays (_counter_per_gpu, _shard_sizes_per_gpu, _shards_id), which cannot be serialized by the standard JSON encoder.

When reader_name is set, these fields are initialized as numpy arrays (see lines 327, 345-346):

  • self._shards_id = np.array([...], dtype=np.int64)
  • self._counter_per_gpu = np.zeros(self._shards_num, dtype=np.int64)
  • self._shard_sizes_per_gpu = self._calculate_shard_sizes(...)

Calling json.dumps() on a dict containing numpy arrays will raise:

TypeError: Object of type ndarray is not JSON serializable

To fix this, you need to either:

  1. Convert numpy arrays to lists: arr.tolist() before serialization
  2. Use a custom JSON encoder that handles numpy types
  3. Keep using pickle for iterator_data (only use JSON for pipeline_data which has simple types)

@stiepan stiepan self-requested a review January 12, 2026 15:31
@stiepan stiepan self-assigned this Jan 12, 2026
@szkarpinski szkarpinski marked this pull request as draft January 12, 2026 17:10
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41566518]: BUILD FAILED

Signed-off-by: Szymon Karpiński <[email protected]>
Signed-off-by: Szymon Karpiński <[email protected]>
@szkarpinski
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41630007]: BUILD STARTED

@szkarpinski szkarpinski marked this pull request as ready for review January 13, 2026 15:21
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41630007]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41630007]: BUILD PASSED

@szkarpinski
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41813463]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41813463]: BUILD FAILED

Signed-off-by: Szymon Karpiński <[email protected]>
@szkarpinski
Copy link
Collaborator Author

!build

@greptile-apps
Copy link

greptile-apps bot commented Jan 19, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [42047609]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [42047609]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [42047609]: BUILD PASSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants