Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

FSDP Issues Tracker #4518

@Rebecca-Qian

Description

@Rebecca-Qian

Description
Tracking known issues during training with FSDP.

  • Issue with resizing embedding dimensions in distributed train
    • Behavior: This throws an exception with embedding sizes out of bound
    • Repro: Train models with --ddp-backend zero2 and setting --special-tok-lst
  • T5 model parallel incompatible with zero2 ddp-backend (possible this affects other HuggingFace agents?)
    • Behavior: thread seems to hang indefinitely
    • Repro: Train models with --t5-model-parallel and --ddp-backend zero2
  • FiD does not work with FSDP and batchsize > 1 (see Cannot train Seeker with batch size > 1 #4531)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions