Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fujii]読み進めメモ 2024 #3

Open
okoge-kaz opened this issue Aug 15, 2024 · 5 comments
Open

[fujii]読み進めメモ 2024 #3

okoge-kaz opened this issue Aug 15, 2024 · 5 comments
Assignees

Comments

@okoge-kaz
Copy link
Member

okoge-kaz commented Aug 15, 2024

Llama-3論文の3.3.3 Collective Communication、3.3.4 Reliability and Operational Challenges における、NCCLXに類似する機能を作りたいモチベーション

@okoge-kaz okoge-kaz self-assigned this Aug 15, 2024
@okoge-kaz
Copy link
Member Author

Llama-3の論文では性能面について以下のように記述されている

The original NCCL collectives—all-gather and reduce-scatter in FSDP, and point-to-point in PP—require data chunking and staged data copy. This approach incurs several inefficiencies, including (1) requiring a large number of small control messages to be exchanged over the network to facilitate data transfer, (2) extra memory-copy operations, and (3) using extra GPU cycles for communication

これがどこを指しているのか分かりたい

@okoge-kaz
Copy link
Member Author

PyTorch 2に関するarticle

https://dl.acm.org/doi/abs/10.1145/3620665.3640366

@okoge-kaz
Copy link
Member Author

Llama-3の論文で引用されているけど上記の論文にPyTorch's build-in NCCL flight recoderがない。
https://discuss.pytorch.org/t/pytorch-nccl-flight-recorder/207410

@okoge-kaz
Copy link
Member Author

nccl読み会ではなく、pytorch読み会になりそうな予感もありますが.....

https://github.com/pytorch/pytorch/releases

@okoge-kaz
Copy link
Member Author

PyTorch flight recoderとは、PyTorch 2.3.0?から入っている問題が発生した rank を教えてくれる機能のこと?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant