Need help with Data Parallel issue with WGAN

I have finally found what causes my WGAN script to not run and I have no idea how to fix it. I have 2 GPUs and I had success running vanilla GAN before. The error I keep getting is

AssertionError: given chunk sizes don't sum up to the tensor's size (sum(chunk_sizes) == 128, but expected 1)

A quick google search shows it’s from Scatter function in cuda.comm.

Docstring shows that “chunk size” is:

(Iterable[int], optional): sizes of chunks to be placed on each device. It should match ``devices`` in length and sum to ``tensor.size(dim)``. If not specified, the tensor will be divided into equal chunks.

I have no idea where to go from here. 128 is my batch size and I have 2 GPUs so is 2 the length? The place it fails is where I am trying to call netD.backward( one ) where one is torch.Size([1]) netD is the discriminator. Everything will work correctly if I just use a single GPU. Obviously, I am doing something wrong distributing my tensors on these 2 GPUs can someone please help me? My data parallel code is from the tutorial please see below! Tell me what part of my code you want to see and I will add to this post. I really want to use 2 GPUs.

netG = Generator() 
netD = Discriminator()

netD.apply(weight_init)
netG.apply(weight_init)

# if torch.cuda.device_count() > 1:
#     print("We are going to use", torch.cuda.device_count(), "GPUs!")
#     netG = nn.DataParallel(netG)
#     netD = nn.DataParallel(netD)

# Commented the above out because of bug
    
if torch.cuda.is_available():
    netG.cuda()
    netD.cuda()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-18-fc6a16ec7b3c> in <module>()
     42         #print('2',netD(real_).size())
     43         output=netD( real_ )
---> 44         output.backward(one)
     45         ## train netd with fake img
     46         fake_pic = netG( noise_ ).detach()

~\Anaconda3\lib\site-packages\torch\autograd\variable.py in backward(self, gradient, retain_graph, create_graph, retain_variables)
    165                 Variable.
    166         """
--> 167         torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
    168 
    169     def register_hook(self, hook):

~\Anaconda3\lib\site-packages\torch\autograd\__init__.py in backward(variables, grad_variables, retain_graph, create_graph, retain_variables)
     97 
     98     Variable._execution_engine.run_backward(
---> 99         variables, grad_variables, retain_graph)
    100 
    101 

~\Anaconda3\lib\site-packages\torch\autograd\function.py in apply(self, *args)
     89 
     90     def apply(self, *args):
---> 91         return self._forward_cls.backward(self, *args)
     92 
     93 

~\Anaconda3\lib\site-packages\torch\nn\parallel\_functions.py in backward(ctx, grad_output)
     57     @staticmethod
     58     def backward(ctx, grad_output):
---> 59         return (None, None) + Scatter.apply(ctx.input_gpus, ctx.input_sizes, ctx.dim, grad_output)
     60 
     61 

~\Anaconda3\lib\site-packages\torch\nn\parallel\_functions.py in forward(ctx, target_gpus, chunk_sizes, dim, input)
     72             # Perform CPU to GPU copies in a background stream
     73             streams = [_get_stream(device) for device in ctx.target_gpus]
---> 74         outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
     75         # Synchronize with the copy stream
     76         if streams is not None:

~\Anaconda3\lib\site-packages\torch\cuda\comm.py in scatter(tensor, devices, chunk_sizes, dim, streams)
    176         assert sum(chunk_sizes) == tensor.size(dim), "given chunk sizes " \
    177             "don't sum up to the tensor's size (sum(chunk_sizes) == {}, but " \
--> 178             "expected {})".format(sum(chunk_sizes), tensor.size(dim))
    179         assert min(chunk_sizes) > 0, "got a negative chunk_size"
    180         chunks = [tensor.narrow(dim, start - size, size)

AssertionError: given chunk sizes don't sum up to the tensor's size (sum(chunk_sizes) == 128, but expected 1)

Could you print the shape out output before calling backward?

The input shape is torch.Size([128, 1, 64, 64]) the output for netD(input) is torch.Size([128, 1, 1, 1]). I just tried to use data parallelism on the WGAN code I shadowed from with my own custom train loader and 2 GPU works. I guess I just need to look at all the other changes I made which aren’t that different except the layer’s activations functions. I might just redownload Pytorch because this is currently the pre-official windows version.

Edited: NVM it also doesn’t work from the code I shadowed from I might as well just reinstall official windows version of pytorch…

Edited2: I got it working but not 100% sure… is it normal to see GPU2 usage at 0% but both GPU1/2’s clock and memeory are utilized? I had to explicitly tell pytorch to put both netD and netG on the same GPU, at first I thought this is the same as using 1 GPU but when everything was running and double checking with evga’s PrecisionX I noticed both cards are running. Does pytorch merge 2 cards into one or something so the other card appears to be idle?

netG = nn.DataParallel(netG, device_ids=[0])
netD = nn.DataParallel(netD, device_ids=[0])