Hello, I tried modifying a ResNet50 from torchvision. The way I did it is as follows:
I modified the second to last layer, which is an avgpool layer like this:
model.avgpool = nn.Sequential(some_building_block())
Where “some_building_block” is a block of a different model. However, the dimensions are correctly adjusted to the output of the final conv layer (which comes right before the avg pool layer that I replace). Additionally, I also adjusted the final fc layer to match get the correct dimensions.
model.fc = nn.Linear(512, 8)
Now, when I test this locally with a small batch of 8 images, both forward() and inference work.
However, when I try and run this exact same network on an external cluster where I have access to 2 GPUs, I get the following error:
Traceback (most recent call last):
File "main.py", line 654, in <module>
main()
File "main.py", line 470, in main
train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
File "main.py", line 520, in train
output = model(input_var)
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/rwthfs/rz/cluster/work/wz052793/temporal-shift-module/ops/models.py", line 280, in forward
base_out = self.new_fc(base_out)
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/home/wz052793/anaconda3/envs/det/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0
When running on the cluster, I also wrap the model in DataParallel like this:
model = torch.nn.DataParallel(model, device_ids=args.gpus).cuda()
I already looked into some similar issues and found this one: where they say the model is distrubuted over CPU and GPU which causes the error. However when I run the non-adjusted ResNet with the exact same code, I do not get this issue.
One of the biggest pains is that I cannot properly debug this code, since I have to submit it as a “batch job” and wait until it runs. That’s why I hope you guys can help me.