epl单机单卡和单机多卡训练step如何理解 #30

SueeH · 2023-09-20T09:50:05Z

单机单卡：
启动命令：TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0 bash ./scripts/train_dp.sh

单机双卡：
启动命令：TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/train_dp.sh

代码修改了一下：去掉了last_step限制，数据集repeat=10，将txt改为py，可执行。
resnet_dp.txt

想请教下，这个如何理解呢？每个卡分别跑了10step？

adoda · 2024-04-24T07:56:33Z

现在配置的batch_size是batch_size/GPU，global_batch_size = batch_size*gpu_num
数据量不变，增大gpu个数，一个epoch跑的step会线性减少。

SueeH changed the title ~~epl单卡显存降低和~~ epl单机单卡和单机多卡训练step如何理解 Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epl单机单卡和单机多卡训练step如何理解 #30

epl单机单卡和单机多卡训练step如何理解 #30

SueeH commented Sep 20, 2023 •

edited

Loading

adoda commented Apr 24, 2024

epl单机单卡和单机多卡训练step如何理解 #30

epl单机单卡和单机多卡训练step如何理解 #30

Comments

SueeH commented Sep 20, 2023 • edited Loading

adoda commented Apr 24, 2024

SueeH commented Sep 20, 2023 •

edited

Loading