Hello,
A model has time-series structure.
So, I’ve split each time-step into each gpu, due to insufficiency of gpu memory.
But I got an error when calling loss.backward():
File "/home/xxx/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function CatBackward returned an invalid gradient at index 1 - expected device cuda:2 but got cuda:0
Could you try enabling anomaly detection mode to see which of the cat is problematic please?
When you have it, make sure that all the inputs are properly on the same GPU.
make sure that all the inputs are properly on the same GPU.
The inputs and the labels(for calculating losses) are in device[cuda:3].
I’ve enabled the anomaly detection by adding this line.
torch.autograd.set_detect_anomaly(True)
The warning was appeared as below:
Warning: Traceback of forward call that caused the error:
File "train.py", line 77, in <module>
main()
File "train.py", line 70, in main
trainer.train()
File "../../utils/trainer.py", line 186, in train
self.train_iteration()
File "../../utils/trainer.py", line 162, in train_iteration
output = self.model(train_data)
File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "../../models/model.py", line 64, in forward
return torch.cat(outputs, 1).cuda(dev)
(print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57)
You should make sure that all the inputs are on the same device before giving them to cat. I am surprised cat did not raise an error during the forward…
Sorry, I’ve tried to make all of inputs to be on the same device.
by changing
return torch.cat(outputs, 1).cuda(dev)
to
return torch.cat([o.cuda(dev) for o in outputs], 1).cuda(dev)
All of tensors will be on cuda:3 before cat.
But it raise a new error. Is it necessary to compute loss separately in each output? (Because it’s in separate device)
RuntimeError: expected device cuda:3 but got device cuda:2
and below is the new anomaly detect warning:
Warning: Traceback of forward call that caused the error:
File "train.py", line 77, in <module>
main()
File "train.py", line 70, in main
trainer.train()
File "../../utils/trainer.py", line 186, in train
self.train_iteration()
File "../../utils/trainer.py", line 162, in train_iteration
output = self.model(train_data)
File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "../../models/mrunet/model.py", line 58, in forward
x = self.outConv(x).cuda(dev)
File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
(print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Well when you concatenate the Tensor, you will end up with a single Tensor. That single Tensor can be only on a single device, So you already compute the loss on a single device no?
@ptrblck do we automatically send Tensors accross devices in torch.cat? That sounds wrong no?
I think this is the case. Here the output tensor will be initialized with the device (options) of the first tensor.
The device check is here and dispatches to parallel_cat.
If different devices are found (or any other condition returns false), the copy_ should push the tensors to the first device in these lines of code.
Well the issue is that the backward formula has not been updated to reflect this… You can see here that it assumes that all inputs are on the same device.