duplicated calculation in gradient 

Hi, I have tried to run the code according to Usage in this repo:
`args = parse_args()
num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
args.num_gpus = num_gpus
args.distributed = num_gpus > 1
if torch.cuda.is_available():
    cudnn.benchmark = False
    args.device = "cuda"
else:
    args.distributed = False
    args.device = "cpu"
if args.distributed:
    torch.cuda.set_device(args.local_rank)
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
    synchronize()

train_loader = get_loader(args=args)

model = get_model(args)
learner = SelfSupervisedLearner(
        model,
        image_size=480,
        hidden_layer='module.avgpool',
        projection_size = 256,
        projection_hidden_size = args.hidden_size,
        moving_average_decay = 0.99
    )

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

if not os.path.exists(args.model_dir):
    os.makedirs(args.model_dir)

for _ in range(args.epochs):
    for idx, images in enumerate(train_loader):
        if torch.cuda.is_available():
            images = images.cuda(non_blocking=True)
        loss = learner(images)
        opt.zero_grad()
        loss.backward()
        opt.step()
        learner.update_moving_average() # update moving average of target encoder

# save your improved network
torch.save(model.state_dict(), './improved-net.pt')`

However, After run this code with distributed learning, during backward(), I got this error message repeated:

`Traceback (most recent call last):
  File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module>
    loss.backward()
  File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Traceback (most recent call last):
  File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module>
    loss.backward()
  File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Traceback (most recent call last):
  File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module>
    loss.backward()
  File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
`

I used detach().clone() instead of detach() in byol_pytorch.py, I got same error. Even if I set torch.autograd.set_detect_anomaly(True), I could not get what is the reason. Would you let me know what part of this code invokes this problem? Thanks in advance. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicated calculation in gradient #91

save your improved network

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

duplicated calculation in gradient #91

Description

save your improved network

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions