Running with 2 GPUs on a cluster: RuntimeError: Caught RuntimeError in replica 0 on device 0

FreddyGump · April 15, 2021, 2:25pm

Hello, I tried modifying a ResNet50 from torchvision. The way I did it is as follows:

I modified the second to last layer, which is an avgpool layer like this:
model.avgpool = nn.Sequential(some_building_block())
Where “some_building_block” is a block of a different model. However, the dimensions are correctly adjusted to the output of the final conv layer (which comes right before the avg pool layer that I replace). Additionally, I also adjusted the final fc layer to match get the correct dimensions.
model.fc = nn.Linear(512, 8)
Now, when I test this locally with a small batch of 8 images, both forward() and inference work.

However, when I try and run this exact same network on an external cluster where I have access to 2 GPUs, I get the following error:

Traceback (most recent call last):
  File "main.py", line 654, in <module>
    main()
  File "main.py", line 470, in main
    train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
  File "main.py", line 520, in train
    output = model(input_var)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/rwthfs/rz/cluster/work/wz052793/temporal-shift-module/ops/models.py", line 280, in forward
    base_out = self.new_fc(base_out)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0

When running on the cluster, I also wrap the model in DataParallel like this:

model = torch.nn.DataParallel(model, device_ids=args.gpus).cuda()

I already looked into some similar issues and found this one: where they say the model is distrubuted over CPU and GPU which causes the error. However when I run the non-adjusted ResNet with the exact same code, I do not get this issue.

One of the biggest pains is that I cannot properly debug this code, since I have to submit it as a “batch job” and wait until it runs. That’s why I hope you guys can help me.

ptrblck · April 16, 2021, 7:24am

The error message points towards a shape mismatch:

RuntimeError: mat1 dim 1 must match mat2 dim 0

so you could add print statements to the model’s forward and check, which operation fails and what the shapes of the used tensors are.

FreddyGump · April 16, 2021, 9:51am

That was indeed the issue. Everything works now.

Thanks a lot, ptrblck!