Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raise KeyError(key) from None KeyError: 'RANK' #48

Open
ybsu opened this issue Jul 30, 2023 · 2 comments
Open

raise KeyError(key) from None KeyError: 'RANK' #48

ybsu opened this issue Jul 30, 2023 · 2 comments

Comments

@ybsu
Copy link

ybsu commented Jul 30, 2023

我在运行vatex部分的training命令,得到了这样的错误,我上网查了下,手动给os.environ['RANK‘]赋值可跳过此错误,但是后面会报错:os.environ['WORLD_SIZE'] key error, 我思考这个问题应该不简单,搞不懂了,请各位大神教我,如何把程序跑通是第一步。。谢谢

File "src/tasks/run_caption_VidSwinBert.py", line 689, in
main(args)
File "src/tasks/run_caption_VidSwinBert.py", line 675, in main
args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer)
File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precision_init
model, optimizer, _, _ = deepspeed.initialize(
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize
dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required)
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed
init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method)
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend
rank = int(os.environ["RANK"])
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

@Accept-AI
Copy link

我在运行vatex部分的training命令,得到了这样的错误,我上网查了下,手动给os.environ['RANK‘]赋值可跳过此错误,但是后面会报错:os.environ['WORLD_SIZE'] key error, 我思考这个问题应该不简单,搞不懂了,请各位大神教我,如何把程序跑通是第一步。。谢谢

File "src/tasks/run_caption_VidSwinBert.py", line 689, in main(args) File "src/tasks/run_caption_VidSwinBert.py", line 675, in main args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer) File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precision_init model, optimizer, _, _ = deepspeed.initialize( File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend rank = int(os.environ["RANK"]) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem raise KeyError(key) from None KeyError: 'RANK'

hello!, 请问您跑通过了吗?解决问题了吗

@a7f4123
Copy link

a7f4123 commented Apr 17, 2024

我在探索多GPU训练。关于这个“RANK”和“WORLD_SIZE”,我能说的就是这是多GPU训练所必需的两个参数;
一般都是以下这样的源代码:

    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ['WORLD_SIZE'])
        args.gpu = int(os.environ['LOCAL_RANK'])
    elif 'SLURM_PROCID' in os.environ:
        args.rank = int(os.environ['SLURM_PROCID'])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print('Not using distributed mode')
        args.distributed = False
        return

我也只能提供这点线索了(笑哭)就是在找怎么解决环境变量没有这两个key才搜到你们的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants