Have the nodes ping out their dataloader state before the all-reduce.

- If a node leaves by crashing, we cannot exactly recover its dataloader state.
- This forces us to manually skip shards to avoid duplicates
- The ideal state is that they can resume automatically from a remote dataloader state
- The dataloader state is not that big and this should not cost too much overhead
- We could interleave it with the all-reduce, completing the all-reduce validates the dataloader state as latest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have the nodes ping out their dataloader state before the all-reduce. #98

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Have the nodes ping out their dataloader state before the all-reduce. #98

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions