Skip to content

Have the nodes ping out their dataloader state before the all-reduce. #98

@Jackmin801

Description

@Jackmin801
  • If a node leaves by crashing, we cannot exactly recover its dataloader state.
  • This forces us to manually skip shards to avoid duplicates
  • The ideal state is that they can resume automatically from a remote dataloader state
  • The dataloader state is not that big and this should not cost too much overhead
  • We could interleave it with the all-reduce, completing the all-reduce validates the dataloader state as latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions