Possible Bug on nn.DataParallel

Bingchen_Liu · March 24, 2019, 3:55am

When I have some multi-branch structures in my network, and put this network on multi gpu, the network won’t be trained and converge. Below is the full code to re-produce the problem

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.utils import spectral_norm

class Discriminator(nn.Module):

	def __init__(self, ndf=64, c_dim=20, nc=3):
		super(Discriminator, self).__init__()
		
		self.block3 = nn.Sequential(
			spectral_norm(nn.Conv2d(nc, ndf, 4, 2, 1)), nn.BatchNorm2d(ndf*1), nn.ReLU()) 
		self.block2 = nn.Sequential(
			spectral_norm(nn.Conv2d(ndf, ndf*2, 4, 2, 1)), nn.BatchNorm2d(ndf*2), nn.ReLU()) 
		self.block1 = nn.Sequential(
			spectral_norm(nn.Conv2d(ndf*2, ndf*4, 4, 2, 1)), nn.BatchNorm2d(ndf*4), nn.ReLU()) 
		self.block0 = nn.Sequential(
			spectral_norm(nn.Conv2d(ndf*4, ndf*8, 4, 2, 1)), nn.BatchNorm2d(ndf*8), nn.ReLU()) 
	
		self.c = nn.Sequential(
			spectral_norm(nn.Conv2d(ndf*8, c_dim, 4, 1, 0)),
			)
		self.rf = spectral_norm(nn.Conv2d(ndf*8, 1, 4, 1, 0))

	def forward(self, x):
		feat = self.block3(x)
		feat = self.block2(feat)
		feat = self.block1(feat)
		feat = self.block0(feat)

		rf = self.rf(feat).view(-1)
		c = self.c(feat)
		return rf, c

netD = Discriminator().cuda()

#netD = nn.DataParallel(netD)

opt_D = optim.Adam(netD.parameters(), lr=0.0001, betas=(0.5, 0.99))


g_image = torch.randn(64, 3, 64, 64).cuda()
real_image = torch.randn(64, 3, 64, 64).cuda()

for itx in range(100):
	netD.zero_grad()

	pred_f, _ = netD(g_image.detach())
	pred_r,_ = netD(real_image)
	d_loss = F.relu( 1-pred_r ).mean() + F.relu( 1+pred_f ).mean()
						
	d_loss.backward()
	opt_D.step()

	print(d_loss.item())

If you comment the line

netD = nn.DataParallel(netD)

The model will be trained without any problem. For your convenience, this code will print out the loss value every iteration, so you know if it converges or not.

As you can see, in my model, I have “feat” goes through 2 different layers in the end, this is the cause of the “can not converge on nn.DataParallel” issue.

tangbohu · August 23, 2019, 12:14pm

Have you solved this issue or found any solutions?

SimonW · August 23, 2019, 3:52pm

There were a few bugs with spectral_norm and DataParallel that were fixed in 1.1. It should work correctly now.

Bingchen_Liu · September 24, 2019, 1:31am

Unfortunately, it is not solved even I am on the latest pytorch version.

Bingchen_Liu · September 24, 2019, 1:41am

Hi. I am on the latest version of PyTorch, and also on the latest version of CUDA drivers, yet the problem still exists.
My environment is pretty common (ubuntu, 2080ti, etc), and the code I provided to reproduce this bug is pretty simple to run (less than 5 seconds including copy-paste the code to a terminal and see the result), also one can easily remove all the “spectral_norm” in the provided code and see that the problem still there.
I actually think this is a pretty big bug since I cannot take advantage of multi-GPU training if my network is designed in this multi-branch way. I am also curious why such a bug is not raising more attention. It makes me assume this is an issue caused by my side. If it is indeed so, could you please help me verify it by testing the provided code and let me know if it works just fine on your side?

kc_dharma · January 4, 2020, 7:32am

Ya, I have same issue. My network reaches train accuracy of 80% on single GPU after 15th epoch and reaches to 28% train accuracy on 4 GPUs after same number of epochs.