Running with 2 GPUs on a cluster: RuntimeError: Caught RuntimeError in replica 0 on device 0

Hello, I tried modifying a ResNet50 from torchvision. The way I did it is as follows:

I modified the second to last layer, which is an avgpool layer like this:
model.avgpool = nn.Sequential(some_building_block())
Where “some_building_block” is a block of a different model. However, the dimensions are correctly adjusted to the output of the final conv layer (which comes right before the avg pool layer that I replace). Additionally, I also adjusted the final fc layer to match get the correct dimensions.
model.fc = nn.Linear(512, 8)
Now, when I test this locally with a small batch of 8 images, both forward() and inference work.

However, when I try and run this exact same network on an external cluster where I have access to 2 GPUs, I get the following error:

Traceback (most recent call last):
  File "main.py", line 654, in <module>
    main()
  File "main.py", line 470, in main
    train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
  File "main.py", line 520, in train
    output = model(input_var)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/rwthfs/rz/cluster/work/wz052793/temporal-shift-module/ops/models.py", line 280, in forward
    base_out = self.new_fc(base_out)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0

When running on the cluster, I also wrap the model in DataParallel like this:

model = torch.nn.DataParallel(model, device_ids=args.gpus).cuda()

I already looked into some similar issues and found this one: where they say the model is distrubuted over CPU and GPU which causes the error. However when I run the non-adjusted ResNet with the exact same code, I do not get this issue.

One of the biggest pains is that I cannot properly debug this code, since I have to submit it as a “batch job” and wait until it runs. That’s why I hope you guys can help me.

The error message points towards a shape mismatch:

RuntimeError: mat1 dim 1 must match mat2 dim 0

so you could add print statements to the model’s forward and check, which operation fails and what the shapes of the used tensors are.

That was indeed the issue. Everything works now.

Thanks a lot, ptrblck!