Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epl单机单卡和单机多卡训练step如何理解 #30

Open
SueeH opened this issue Sep 20, 2023 · 1 comment
Open

epl单机单卡和单机多卡训练step如何理解 #30

SueeH opened this issue Sep 20, 2023 · 1 comment

Comments

@SueeH
Copy link

SueeH commented Sep 20, 2023

单机单卡:
启动命令:TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0 bash ./scripts/train_dp.sh
image

单机双卡:
启动命令:TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/train_dp.sh
1693045873752

代码修改了一下:去掉了last_step限制,数据集repeat=10,将txt改为py,可执行。
resnet_dp.txt

想请教下,这个如何理解呢?每个卡分别跑了10step?

@SueeH SueeH changed the title epl单卡显存降低和 epl单机单卡和单机多卡训练step如何理解 Sep 20, 2023
@adoda
Copy link
Collaborator

adoda commented Apr 24, 2024

现在配置的batch_size是batch_size/GPU,global_batch_size = batch_size*gpu_num
数据量不变,增大gpu个数,一个epoch跑的step会线性减少。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants