NaN loss for GAN

seanremy · November 6, 2017, 5:59pm

I am working on a GAN with a novel architecture: the GAN is trying to block a single square of fixed dimension within the image. It outputs a one-hot vector of x- and y-coordinates of the top-left corner of the square which are then turned into an image-sized filter representing the whole square by adding the one-hot vector to itself with various offsets. The (slightly messy) code for doing so is below.

# we first have to add the zero offset to the end of the one-hot-vec
# s_bs is the size of the 'blocking square'
full_zeros = Variable(torch.zeros(batch_sz, self.s_bs))
ohv_y_full = torch.cat([ohv_y, full_zeros], dim=1)
ohv_x_full = torch.cat([ohv_x, full_zeros], dim=1)
# add the correct size offset to the front and to the back
for i in range(1, self.s_bs):
	trailing_zeros = Variable(torch.zeros(batch_sz, self.s_bs-i))			
	leading_zeros = Variable(torch.zeros(batch_sz, i))
	ohv_y_full+= torch.cat([leading_zeros, ohv_y, trailing_zeros], dim=1)
	ohv_x_full+= torch.cat([leading_zeros, ohv_x, trailing_zeros], dim=1)

Basically for s_bs=3 this will do:
[0.1,1,0.2,0,0,0] →
[0.1,1.1,1.3,1.2,.2,0,0,0]

(going from a vector of size s to size s + s_bs - 1)

I am using cross entropy loss. I reverse the loss in order to do gradient ascent:

for group in optimizer.param_groups:
	for p in group['params']:
		p.grad = -1*p.grad

However, as soon as I do this, my loss becomes nan, unless I use a learning rate of about 1e-10, which is obviously too low, and the model never converges. If I comment out the reversal of the loss, then it works just fine. Additionally, if I reverse the loss but use s_bs=1, it also works just fine. Besides the step for turning the one-hot-vector into a square, the setup I have for my GAN is pretty straight-forward. Does anyone have any insight as to why my setup is causing me to have an unstable loss? Or does anyone have any alternative suggestions for how to do that step? Ideally I would do something like np.convolve but there is not a pytorch version of that with autograd.

Edit: Here is the computation graph for ohv_y_full:

Thanks!!

tom · November 6, 2017, 7:53pm

Hi,

I’m not sure I understand enough about what you describe to comment on the NaNs.

Would the functions for convolution not work for this?

Best regards

Thomas

seanremy · November 6, 2017, 10:09pm

So like a Conv1D with weights of 1 applied to all the shifts of the vector?
i.e.
s_bs = 3
[0, 1, 0, 0, 0]
get all shifts ->
[0, 1, 0, 0, 0, 0, 0] [0, 0, 1, 0, 0, 0, 0] [0, 0, 0, 1, 0, 0, 0]
Conv1D with fixed weights ->
[0, 1, 1, 1, 0, 0, 0]

I was worried fixing the weights of a Conv1D layer would be messier, but I can try that, but wouldn’t it functionally do exactly the same thing as I did?

tom · November 7, 2017, 6:59am

Well you can either use a layer and fix the parameter or use the functional interface where you pass the weights. I did a small demo of a discrete wavelet transform using the latter. Maybe you can take bits of that for your inspiration.

Best regards

Thomas