- If a node leaves by crashing, we cannot exactly recover its dataloader state. - This forces us to manually skip shards to avoid duplicates - The ideal state is that they can resume automatically from a remote dataloader state - The dataloader state is not that big and this should not cost too much overhead - We could interleave it with the all-reduce, completing the all-reduce validates the dataloader state as latest