Gradient mismatch for complex networks using nn.DataParallel

Dear all,

I am currently building a network that gets complex numbers as input, processes them using complex layers, and gives back again complex numbers. I already recognized, that passing the input to the model, causes the input to automatically be viewed as 2-channeled real values, as is the case for applying torch.view_as_real(). Since I handle data in the complex view the first thing I do in the forward() method of my network is transforming it back to complex view.

When I use only one GPU to train, everything works fine and I have the expected behavior.
Given that I want to train on multiple GPUs I get the following error message in the backward pass.

Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function GatherBackward returned an invalid gradient at index 0 - got [2, 3, 128, 128, 2] but expected shape compatible with [2, 3, 128, 128]

Now for me, it looks like that as soon as I train on multiple GPUs the gradients are in the shape of the 2-channeled real view, while my inputs and weights are in complex view.

I am currently using DataParallel as:

if not 'CUDA_VISIBLE_DEVICES' in os.environ:
        os.environ['CUDA_VISIBLE_DEVICES'] = f'{args.gpus}'
model = torch.nn.DataParallel(model)

and pytorch 1.12.1 py3.9_cuda10.2_cudnn7.6.5_0

Maybe someone can shed some light on why this happens, and how to fix.

Best regards,
Niklas

I’m not sure why the error is raised, but note that nn.DataParallel is in “maintenance mode” as of this RFC and should eventually be removed.
We recommend using DistributedDataParallel for a better performance and support.
Could you try to use your code using DDP and see if the same error is raised?

Dear Piotr,
thank you very much for the advice. Using DDP instead of DP actually solved the issue.
From my point of view, I sadly did not find out what the actual problem is with using DP. But in case someone else has this problem I want to refer to this medium article, which helped in understanding and setting up the DDP along with the official Pytorch documentation.

Best regards,
Niklas