Speed benchmarks vs FSDP2 #3

152334H · 2024-06-12T07:58:58Z

Are the benchmarks conducted against FSDP or FSDP2?

antony-frolov · 2024-06-12T09:02:59Z

The benchmarks were conducted against FSDP1, we used an early version of PyTorch 2.3.0 of November 2023 in our experiments

awgu · 2024-06-13T17:41:33Z

@antony-frolov Exciting project! Maybe it would help if you published some absolute performance numbers like tokens per second? I think right now I see % speedups only.

(Also, FSDP2 does extra device copies compared to FSDP1/YaFSDP, so we would not really expect FSDP2 to be faster.)

antony-frolov · 2024-06-14T11:24:57Z

@awgu thanks! just added absolute iteration time numbers for all the runs, hope that might help. though as measurements were done in a pretty vanilla distributed training setup (mostly for the ease of reproducibility) absolute numbers might not look too convincing when compared to frameworks more optimized for LLM pre-training

antony-frolov · 2024-06-20T15:15:41Z

@awgu here are the traces of Llama 2 34B runs on 256 devices with sequence length of 2048 for both FSDP and YaFSDP (these are the runs we compare in Advantages over FSDP section of the README).

llama-2-34b_256_2048_ya-fsdp.json
llama-2-34b_256_2048_fsdp.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed benchmarks vs FSDP2 #3

Speed benchmarks vs FSDP2 #3

152334H commented Jun 12, 2024

antony-frolov commented Jun 12, 2024

awgu commented Jun 13, 2024

antony-frolov commented Jun 14, 2024 •

edited

Loading

antony-frolov commented Jun 20, 2024

Speed benchmarks vs FSDP2 #3

Speed benchmarks vs FSDP2 #3

Comments

152334H commented Jun 12, 2024

antony-frolov commented Jun 12, 2024

awgu commented Jun 13, 2024

antony-frolov commented Jun 14, 2024 • edited Loading

antony-frolov commented Jun 20, 2024

antony-frolov commented Jun 14, 2024 •

edited

Loading