Gradient mismatch for complex networks using nn.DataParallel

Minennick · March 13, 2023, 6:05pm

Dear all,

I am currently building a network that gets complex numbers as input, processes them using complex layers, and gives back again complex numbers. I already recognized, that passing the input to the model, causes the input to automatically be viewed as 2-channeled real values, as is the case for applying torch.view_as_real(). Since I handle data in the complex view the first thing I do in the forward() method of my network is transforming it back to complex view.

When I use only one GPU to train, everything works fine and I have the expected behavior.
Given that I want to train on multiple GPUs I get the following error message in the backward pass.

Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function GatherBackward returned an invalid gradient at index 0 - got [2, 3, 128, 128, 2] but expected shape compatible with [2, 3, 128, 128]

Now for me, it looks like that as soon as I train on multiple GPUs the gradients are in the shape of the 2-channeled real view, while my inputs and weights are in complex view.

I am currently using DataParallel as:

if not 'CUDA_VISIBLE_DEVICES' in os.environ:
        os.environ['CUDA_VISIBLE_DEVICES'] = f'{args.gpus}'
model = torch.nn.DataParallel(model)

and pytorch 1.12.1 py3.9_cuda10.2_cudnn7.6.5_0

Maybe someone can shed some light on why this happens, and how to fix.

Best regards,
Niklas

ptrblck · March 14, 2023, 6:14am

I’m not sure why the error is raised, but note that nn.DataParallel is in “maintenance mode” as of this RFC and should eventually be removed.
We recommend using DistributedDataParallel for a better performance and support.
Could you try to use your code using DDP and see if the same error is raised?

Minennick · March 14, 2023, 1:25pm

Dear Piotr,
thank you very much for the advice. Using DDP instead of DP actually solved the issue.
From my point of view, I sadly did not find out what the actual problem is with using DP. But in case someone else has this problem I want to refer to this medium article, which helped in understanding and setting up the DDP along with the official Pytorch documentation.

Best regards,
Niklas

Stanislaw_Raczynski · May 15, 2023, 1:19pm

Hi,
I’m running Pytorch 2.0.1 with CUDA 12.1 and am getting the same problem. I get the following exception on GPU:1, on GPU:0 it runs without issues.

RuntimeError: Function GatherBackward returned an invalid gradient at index 0 - got [5, 512, 901, 2] but expected shape compatible with [5, 512, 901]
> /work/miniconda3/envs/sayso_dev/lib/python3.9/site-packages/torch/autograd/__init__.py(200)backward()
    198     # some Python versions print out the first line of a multi-line function
    199     # calls in the traceback and some print out the last line
--> 200     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    201         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    202         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass

Minennick · May 15, 2023, 1:23pm

@Stanislaw_Raczynski Do you use DataParallel or Distributed Data Parallel?

Stanislaw_Raczynski · May 15, 2023, 1:33pm

Sorry I wasn’t clear. I was using DataParallel, as in the initial post. I am trying to move to DistributedDataParallel as per suggestions from ptrblck.

The error above points to a complex tensor (result of torch.stft()) within my model.