Dear all,
I am currently building a network that gets complex numbers as input, processes them using complex layers, and gives back again complex numbers. I already recognized, that passing the input to the model, causes the input to automatically be viewed as 2-channeled real values, as is the case for applying torch.view_as_real(). Since I handle data in the complex view the first thing I do in the forward() method of my network is transforming it back to complex view.
When I use only one GPU to train, everything works fine and I have the expected behavior.
Given that I want to train on multiple GPUs I get the following error message in the backward pass.
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function GatherBackward returned an invalid gradient at index 0 - got [2, 3, 128, 128, 2] but expected shape compatible with [2, 3, 128, 128]
Now for me, it looks like that as soon as I train on multiple GPUs the gradients are in the shape of the 2-channeled real view, while my inputs and weights are in complex view.
I am currently using DataParallel as:
if not 'CUDA_VISIBLE_DEVICES' in os.environ:
os.environ['CUDA_VISIBLE_DEVICES'] = f'{args.gpus}'
model = torch.nn.DataParallel(model)
and pytorch 1.12.1 py3.9_cuda10.2_cudnn7.6.5_0
Maybe someone can shed some light on why this happens, and how to fix.
Best regards,
Niklas