Gradients differ across GPUs with DDP

Hi,

What are possible reasons for gradients to differ across GPUs after a backward() call, when using DistributedDataParallel (DDP)? If I understood correctly wrapping the model with DDP in the main worker should take care of gradient averaging and their synchronization across GPUs?

Thank you

1 Like

If you enable no_sync context manger, you will turn off the communication that averages gradients.

See: DistributedDataParallel — PyTorch 1.8.1 documentation

1 Like

Sorry, maybe I was unclear. My problem is that the gradients are not averaging across GPUs and am searching for possible reasons as to why this is happening.

Are all the GPUs on the same machine, and they have the same type?

Do you mean the gradients are not sync even in the end of training?

Can you share some code to reproduce this?

Yes, the GPUs are on the same computing node, all GPUs are of the same type. What I mean by not synced is if I call tensor.data.grad after the backward call in the main worker they differ on each GPU. Which essentially means each GPUs are training separately but not really syncing the gradients at any point.

I’m afraid I cannot post the entire code and a minimal example might not capture the problem. That’s why I’m only asking for possible reasons in general.

My model does not contain a forward method, can this confuse the DDP?

My model does not contain a forward method, can this confuse the DDP?

I am confused. Do you mean that you have only used built-in layers such as:

           model = nn.Sequential(
                nn.Conv2d(i, j, k),
                nn.ReLU(),
                ....
            )

If so, you still called forward of each module implicitly, and the allreduce during backward pass should be triggered.

Without investigating the source code, I cannot find any other reason of unsynced gradients, if you didn’t use no_sync context manager. You should verify if allreduce is ever been invoked. A few ideas:

  1. You can try torch.profiler and check if there is any allreduce operator in your GPU traces.
  2. Alternatively, you can try registering a PowerSGD DDP comm hook, and check if there is any log about PowerSGD stats.
  3. Not sure if you use the slower DataParallel instead of DistributedDataParallel can bypass this.

I used a function outside of the model class to perform the forward pass. It seems DDP requires a forward method nested in the nn.Module, which is not stated in the documentation.

After changing my code to include the forward method the error is the following:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the 'forward' function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple 'checkpoint' functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

The code works fine on a single GPU. So I do not quite understand the error, since backpropagating twice through the same graph without retain_graph=True would not be possible anyway.

1 Like