Huge loss with DataParallel

neale · March 24, 2019, 8:50am

I’m training a pretty standard WGAN-GP on MNIST, and I’m trying out much larger batch sizes since that seems to be the standard wisdom now. When I parallelize across multiple GPUs I get enormous losses compared to using just one GPU. If I initialize my networks with nn.DataParallel(model, [0]) then I get pretty normal functionality:

ITER 0: D cost 1.200, G cost 2.729
ITER 200: D cost -3.931, G cost 2.298

But if I use nn.DataParallel(model, [0, 1, 2]) to run across more GPUs, I get absurd numbers:

ITER 200: D cost 112856899584.0, G cost 456269.437

I’ve used dataparallel successfully before with classifiers and the tutorial DCGAN, so I have no idea what the issue is here. I’m not parallelizing the loss or doing anything different than initializing the networks in this way. Are there some caveats here (like with FP16) that I’m not aware of with DataParallel?

I can post a full code example, but its a hundred lines or so and I’d rather not start off the post like that.

pietern · March 25, 2019, 4:56pm

The DataParallel module is pretty straightforward: it splits the input up into N chunks (typically across the first dimension), runs the same forward pass on N replicas of your model, and gathers the output back into a single tensor (across the same dimension as the input was split). Gradients are always accumulated in the source model (not the replicas).

It looks like updates to buffers in these replicas don’t propagate back to the source model, as the model replicas are tossed after every forward pass. Perhaps this is a starting point for your investigation?

neale · March 25, 2019, 11:14pm

@pietern even if the gradients didn’t accumulate properly, it should at worst look as if I only have one GPU. I’ve verified that everything works well with up to 4 GPUs with pytorch 0.4.1. But I can’t seem to nail down the actual problem in 1.0

I’m not even sure how to diagnose this, but I’ve been able to replicate this behavior with dataparallel on 2 other popular WGAN repos from github.

Hung_Nguyen · March 28, 2019, 1:03am

I had the same problem with with wp-gan. For some reasons, the gradient penalty increases quickly until it reaches inf.

Also, this happens for other gan too, e.g Self supervised gan. The behavior of loss function is also completely different when training with multi gpus.

neale · March 28, 2019, 1:34am

@Hung_Nguyen Right.
Remove the gradient penalty and the loss should still endlessly increase.
The official(?) Wasserstein GAN code doesn’t suffer from this weird behavior with parallelization, so that could be a starting point. It uses nn.parallel.data_parallel - the functional version of nn.DataParallel, but I don’t know if there’s an interesting difference there.
Converting any WGAN-GP repo into a regular WGAN mitigates the behavior somewhat, but weight clipping has its drawbacks.

aluo-x · April 7, 2019, 12:09am

Currently also experiencing this issue. The gradient penalty eventually goes to the millions before blowing up completely. @neale, I have tried to reproduce this issue with various popular WGAN-GP repos as well, and they also suffer from this.

Pytorch 1.0.1 on CUDA 10.1.

pietern · April 8, 2019, 5:06pm

@neale With 0.4 working and 1.0 not working this is clearly a regression. But we haven’t significantly modified (if at all) the data parallel wrapper between these versions. Can you check if the regression happened in 1.0.0 or 1.0.1? I have created https://github.com/pytorch/pytorch/issues/19024 to track this.

pietern · April 8, 2019, 5:11pm

@neale Could you also try reproducing with the nightly build? There have been some changes recently related to CUDA stream synchronization that may have fixed this, per @mrshenli.

Nurble · April 9, 2019, 12:45pm

Running into this problem as well, using a WGAN-GP and it works perfectly on 1 GPU but the loss explodes when running on multiple GPUs.

Using CUDA 9.2, PyTorch 1.0.1. Working on installing the nightly build to see if there is any difference.

neale · May 1, 2019, 7:07pm

@pietern Sorry this took quite some time.
I can confirm that the issue persists into version 1.1

aluo-x · July 2, 2019, 1:15am

It seems like the issue is #16433.

A workaround would be to calculate the gradient-penalty directly (without calling a function to do so) and calling backward in the same scope.

For example the following code will explode on CUDA with multi-gpu:

gp = calc_grad_penalty(network, real_target, fake_target)
gp.backwards(retain_graph=True)

While the following does not:

### gp += torch.autograd.grad()
### etc. etc. code to calcualte GP
gp.backwards(retain_graph=True)

pietern · July 17, 2019, 2:01pm

Thanks, @aluo-x.

It is also the same as #16532 and there has been an attempt at a fix. The problem lies somewhere deep in the guts of autograd. This has surfaced a couple of times and there should be a fix soon (and it should be included in the next stable release).

kiddyboots216 · July 18, 2019, 2:25am

@pietern you mentioned that gradient accumulation only occurs in the source model, but according to https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/data_parallel.py “In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.” this implies that gradients are accumulated in the leaf nodes of each of the replicas.

pietern · July 24, 2019, 8:28am

FYI, excellent debugging from @mrshenli and @ezyang deep in the guts of autograd led to https://github.com/pytorch/pytorch/pull/22983 and this was merged yesterday. Please give the latest nightly builds a try to see if fixed the issue.