We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
单机单卡: 启动命令:TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0 bash ./scripts/train_dp.sh
单机双卡: 启动命令:TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/train_dp.sh
代码修改了一下:去掉了last_step限制,数据集repeat=10,将txt改为py,可执行。 resnet_dp.txt
想请教下,这个如何理解呢?每个卡分别跑了10step?
The text was updated successfully, but these errors were encountered:
现在配置的batch_size是batch_size/GPU,global_batch_size = batch_size*gpu_num 数据量不变,增大gpu个数,一个epoch跑的step会线性减少。
Sorry, something went wrong.
No branches or pull requests
单机单卡:
启动命令:TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0 bash ./scripts/train_dp.sh
单机双卡:
启动命令:TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/train_dp.sh
代码修改了一下:去掉了last_step限制,数据集repeat=10,将txt改为py,可执行。
resnet_dp.txt
想请教下,这个如何理解呢?每个卡分别跑了10step?
The text was updated successfully, but these errors were encountered: