I have finally found what causes my WGAN script to not run and I have no idea how to fix it. I have 2 GPUs and I had success running vanilla GAN before. The error I keep getting is
AssertionError: given chunk sizes don't sum up to the tensor's size (sum(chunk_sizes) == 128, but expected 1)
A quick google search shows it’s from Scatter
function in cuda.comm
.
Docstring shows that “chunk size
” is:
(Iterable[int], optional): sizes of chunks to be placed on each device. It should match ``devices`` in length and sum to ``tensor.size(dim)``. If not specified, the tensor will be divided into equal chunks.
I have no idea where to go from here. 128
is my batch size
and I have 2 GPUs so is 2 the length? The place it fails is where I am trying to call netD.backward( one )
where one
is torch.Size([1])
netD
is the discriminator. Everything will work correctly if I just use a single GPU. Obviously, I am doing something wrong distributing my tensors on these 2 GPUs can someone please help me? My data parallel code is from the tutorial please see below! Tell me what part of my code you want to see and I will add to this post. I really want to use 2 GPUs.
netG = Generator()
netD = Discriminator()
netD.apply(weight_init)
netG.apply(weight_init)
# if torch.cuda.device_count() > 1:
# print("We are going to use", torch.cuda.device_count(), "GPUs!")
# netG = nn.DataParallel(netG)
# netD = nn.DataParallel(netD)
# Commented the above out because of bug
if torch.cuda.is_available():
netG.cuda()
netD.cuda()
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-18-fc6a16ec7b3c> in <module>()
42 #print('2',netD(real_).size())
43 output=netD( real_ )
---> 44 output.backward(one)
45 ## train netd with fake img
46 fake_pic = netG( noise_ ).detach()
~\Anaconda3\lib\site-packages\torch\autograd\variable.py in backward(self, gradient, retain_graph, create_graph, retain_variables)
165 Variable.
166 """
--> 167 torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
168
169 def register_hook(self, hook):
~\Anaconda3\lib\site-packages\torch\autograd\__init__.py in backward(variables, grad_variables, retain_graph, create_graph, retain_variables)
97
98 Variable._execution_engine.run_backward(
---> 99 variables, grad_variables, retain_graph)
100
101
~\Anaconda3\lib\site-packages\torch\autograd\function.py in apply(self, *args)
89
90 def apply(self, *args):
---> 91 return self._forward_cls.backward(self, *args)
92
93
~\Anaconda3\lib\site-packages\torch\nn\parallel\_functions.py in backward(ctx, grad_output)
57 @staticmethod
58 def backward(ctx, grad_output):
---> 59 return (None, None) + Scatter.apply(ctx.input_gpus, ctx.input_sizes, ctx.dim, grad_output)
60
61
~\Anaconda3\lib\site-packages\torch\nn\parallel\_functions.py in forward(ctx, target_gpus, chunk_sizes, dim, input)
72 # Perform CPU to GPU copies in a background stream
73 streams = [_get_stream(device) for device in ctx.target_gpus]
---> 74 outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
75 # Synchronize with the copy stream
76 if streams is not None:
~\Anaconda3\lib\site-packages\torch\cuda\comm.py in scatter(tensor, devices, chunk_sizes, dim, streams)
176 assert sum(chunk_sizes) == tensor.size(dim), "given chunk sizes " \
177 "don't sum up to the tensor's size (sum(chunk_sizes) == {}, but " \
--> 178 "expected {})".format(sum(chunk_sizes), tensor.size(dim))
179 assert min(chunk_sizes) > 0, "got a negative chunk_size"
180 chunks = [tensor.narrow(dim, start - size, size)
AssertionError: given chunk sizes don't sum up to the tensor's size (sum(chunk_sizes) == 128, but expected 1)