Multi-GPU backward error

doliolarzz · March 16, 2020, 5:24pm

Hello,
A model has time-series structure.
So, I’ve split each time-step into each gpu, due to insufficiency of gpu memory.

But I got an error when calling loss.backward():

File "/home/xxx/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function CatBackward returned an invalid gradient at index 1 - expected device cuda:2 but got cuda:0

Please find the minimum code below:

    self.config['DEVICE_ALL'] = [0, 2, 3]
    self.backbone = UNet(n_channels=config['IN_LEN']*3 - 2)
    self.geo = nn.Parameter(data=torch.randn(1, 4, self.h, self.w), requires_grad=True)
    self.outConv = nn.Conv2d(68, 1, kernel_size=1)
    self.output_time_length = 3

    def forward(self, input):

        outputs = []
        cur_input = input
        cur_state = self.get_state(...)

        for i in range(self.output_time_length):
            
            dev = self.config['DEVICE_ALL'][i]
            cur_input = cur_input.cuda(dev)
            cur_state = cur_state.cuda(dev)
            x = torch.cat([cur_input, cur_state], 1).cuda(dev)

            self.backbone = self.backbone.cuda(dev)
            x = self.backbone(x).cuda(dev)

            geo = self.geo.expand(x.shape[0], -1, -1, -1).cuda(dev)
            x = torch.cat([x, geo], 1).cuda(dev)

            self.outConv = self.outConv.cuda(dev)
            x = self.outConv(x).cuda(dev)

            outputs.append(x)
            cur_input = torch.cat([cur_input[:, 1:], x], 1)
            cur_state = self.get_next_state(...)

        return torch.cat(outputs, 1).cuda(dev)

Thanks in advance!

albanD · March 16, 2020, 6:22pm

Hi,

Could you try enabling anomaly detection mode to see which of the cat is problematic please?
When you have it, make sure that all the inputs are properly on the same GPU.

doliolarzz · March 16, 2020, 6:43pm

Hello, thank you for you help.

make sure that all the inputs are properly on the same GPU.

The inputs and the labels(for calculating losses) are in device[cuda:3].

I’ve enabled the anomaly detection by adding this line.

torch.autograd.set_detect_anomaly(True)

The warning was appeared as below:

Warning: Traceback of forward call that caused the error:
  File "train.py", line 77, in <module>
    main()
  File "train.py", line 70, in main
    trainer.train()
  File "../../utils/trainer.py", line 186, in train
    self.train_iteration()
  File "../../utils/trainer.py", line 162, in train_iteration
    output = self.model(train_data)
  File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "../../models/model.py", line 64, in forward
    return torch.cat(outputs, 1).cuda(dev)
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57)

Thank you.

albanD · March 16, 2020, 7:13pm

Awesome.
Can you print the size and device of each element in outputs juste before that line?

doliolarzz · March 16, 2020, 7:35pm

The outputs seem to be like this:

cuda:0 torch.Size([2, 1, 840, 640])
cuda:2 torch.Size([2, 1, 840, 640])
cuda:3 torch.Size([2, 1, 840, 640])

These size are in format of (batch_size, channel, Height, Width).

The labels data for back propagation are:
cuda:3, torch.Size([2, 3, 1, 840, 640])

The ‘MSELoss’ function is also sent to ‘cuda:3’ too.

albanD · March 16, 2020, 8:11pm

You should make sure that all the inputs are on the same device before giving them to cat. I am surprised cat did not raise an error during the forward…

doliolarzz · March 16, 2020, 8:39pm

Sorry, I’ve tried to make all of inputs to be on the same device.
by changing

return torch.cat(outputs, 1).cuda(dev)

to

return torch.cat([o.cuda(dev) for o in outputs], 1).cuda(dev)

All of tensors will be on cuda:3 before cat.
But it raise a new error. Is it necessary to compute loss separately in each output? (Because it’s in separate device)

RuntimeError: expected device cuda:3 but got device cuda:2

and below is the new anomaly detect warning:

Warning: Traceback of forward call that caused the error:
  File "train.py", line 77, in <module>
    main()
  File "train.py", line 70, in main
    trainer.train()
  File "../../utils/trainer.py", line 186, in train
    self.train_iteration()
  File "../../utils/trainer.py", line 162, in train_iteration
    output = self.model(train_data)
  File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "../../models/mrunet/model.py", line 58, in forward
    x = self.outConv(x).cuda(dev)
  File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/xxx/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57)

albanD · March 16, 2020, 9:20pm

Well when you concatenate the Tensor, you will end up with a single Tensor. That single Tensor can be only on a single device, So you already compute the loss on a single device no?

@ptrblck do we automatically send Tensors accross devices in torch.cat? That sounds wrong no?

ptrblck · March 16, 2020, 11:14pm

I think this is the case.
Here the output tensor will be initialized with the device (options) of the first tensor.
The device check is here and dispatches to parallel_cat.
If different devices are found (or any other condition returns false), the copy_ should push the tensors to the first device in these lines of code.

Could this explain the issue using multiple GPUs?

doliolarzz · March 17, 2020, 2:34am

Yes, the loss is computed in a single device. But I really don’t understand why it still raise that error.

I’m very newbie in PyTorch. Could you please guide me what I’ve done wrong. Thank you.

albanD · March 17, 2020, 2:29pm

Well the issue is that the backward formula has not been updated to reflect this… You can see here that it assumes that all inputs are on the same device.

doliolarzz · March 17, 2020, 3:41pm

Hello,
I see. I think I’ll try to calculate loss and backprops in separate devices.
Do you have any other suggestion or workaround?

Thank you. I’m really appreciated your help : )