Erratic GAN loss behaviour. Fails completely with LSGAN and WGANGP loss functions

asberman · November 23, 2018, 4:52pm

I’m investigating the use of a Wasserstein GAN with gradient penalty in PyTorch. I’m heavily borrowing from Caogang’s implementation, but am using the discriminator and generator losses used in this implementation because I get Invalid gradient at index 0 - expected shape[] but got [1] if I try to call .backward() with the one and mone args used in the Caogang implementation.

I’m training on a dataset of 400k 64x64 images, and have gotten a normal WGAN (with weight clipping to work) [i.e. it produces passable images after 25 epochs], despite the fact that the D and G losses both hover around 3. I calculate them using torch.mean(D_real) etc. for all epochs. However, in the WGAN-GP version, the generator loss increases dramatically (starts at ~24, then climbs rapidly to 6000 (!) in only the 6th epoch), while the discriminator loss starts at -7, decreases to -5000, then by the 6th epoch is up to +50 (!). WGAN-GP and LSGAN versions of my GAN both completely fail to produce passable images even after 25 epochs. I use nn.MSELoss() for the LSGAN version of my GAN.

I don’t use any tricks like one-sided label smoothing, and I train with default learning rats in both the LSGAN and WGANGP papers. I use the Adam optimizer and I train the discriminator 5 times for every generator update in my WGANs. Why does this crazy loss behavior happen, and why does the normal weight-clipping WGAN still ‘work’ but WGANGP and LSGAN completely fail?

This happens when using LSGAN or WGANGP irrespective of the structure, whether both G and D are normal DCGANs or when using this modified DCGAN, the Creative Adversarial Network, which requires that D be able to classify images and G generate ambiguous images. It does this through an additional K-way classification loss, for which I’m using nn.CrossEntropyLoss, and adding to D_loss.

I get erratic loss behavior (G and D losses not steadily decreasing, but instead going up and down) in ‘normal’ DCGAN versions of my GAN, with a nn.BCELoss and the following D and G networks:

class Can64Discriminator(nn.Module):
    def __init__(self, channels,y_dim, num_disc_filters):
            super(Can64Discriminator, self).__init__()
            self.ngpu = 1
            self.conv = nn.Sequential(
                    nn.Conv2d(channels, num_disc_filters // 2, 4, 2, 1, bias=False),
                    nn.LeakyReLU(0.2, inplace=True),
            
                    nn.Conv2d(num_disc_filters // 2, num_disc_filters, 4, 2, 1, bias=False),
                    nn.BatchNorm2d(num_disc_filters),
                    nn.LeakyReLU(0.2, inplace=True),
            
                    nn.Conv2d(num_disc_filters, num_disc_filters * 2, 4, 2, 1, bias=False),
                    nn.BatchNorm2d(num_disc_filters * 2),
                    nn.LeakyReLU(0.2, inplace=True),
            
                    nn.Conv2d(num_disc_filters * 2, num_disc_filters * 4, 4, 2, 1, bias=False),
                    nn.BatchNorm2d(num_disc_filters * 4),
                    nn.LeakyReLU(0.2, inplace=True),
        
                    nn.Conv2d(num_disc_filters * 4, num_disc_filters * 8, 4, 1, 0, bias=False),
                    nn.BatchNorm2d(num_disc_filters * 8),
                    nn.LeakyReLU(0.2, inplace=True),
        
                )
      
            self.real_fake_head = nn.Linear(num_disc_filters * 8, 1)
            
            # no bn and lrelu needed
            self.sig = nn.Sigmoid()
            self.fc = nn.Sequential() 
            self.fc.add_module("linear_layer{0}".format(num_disc_filters*16),nn.Linear(num_disc_filters*8,num_disc_filters*16))
            self.fc.add_module("linear_layer{0}".format(num_disc_filters*8),nn.Linear(num_disc_filters*16,num_disc_filters*8))
            self.fc.add_module("linear_layer{0}".format(num_disc_filters),nn.Linear(num_disc_filters*8,y_dim))
            self.fc.add_module('softmax',nn.Softmax(dim=1))
        
    def forward(self, inp):
        x = self.conv(inp)
        x = x.view(x.size(0),-1) 
        real_out = self.sig(self.real_fake_head(x))
        real_out = real_out.view(-1,1).squeeze(1)
        style = self.fc(x) 
        return real_out,style

class Can64Generator(nn.Module):
    def __init__(self, z_noise, channels, num_gen_filters):
        super(Can64Generator,self).__init__()
        self.ngpu = 1
        self.main = nn.Sequential(
        nn.ConvTranspose2d(z_noise, num_gen_filters * 16, 4, 1, 0, bias=False),
        nn.BatchNorm2d(num_gen_filters * 16),
        nn.ReLU(True),
        nn.ConvTranspose2d(num_gen_filters * 16, num_gen_filters * 4, 4, 2, 1, bias=False),
        nn.BatchNorm2d(num_gen_filters * 4),
        nn.ReLU(True),
        nn.ConvTranspose2d(num_gen_filters * 4, num_gen_filters * 2, 4, 2, 1, bias=False),
        nn.BatchNorm2d(num_gen_filters * 2),
        nn.ReLU(True),
        nn.ConvTranspose2d(num_gen_filters * 2, num_gen_filters, 4, 2, 1, bias=False),
        nn.BatchNorm2d(num_gen_filters),
        nn.ReLU(True),
        nn.ConvTranspose2d(num_gen_filters, 3, 4, 2, 1, bias=False),
        nn.Tanh()
        )
    def forward(self, inp):
        output = self.main(inp)
        return output

What could be causing this? I’d like to make as minimal change as possible, as I want to compare loss functions alone. Any help would be greatly appreciated.

Thanks in advance!

ptrblck · November 26, 2018, 3:40pm

The fc of your discriminator looks a bit strange, as it is missing activation functions.
Basically you have a multi-layers linear transformation. Is this on purpose?

I’m not really familiar with LSGAN and WGANGP, but what kind of loss function are you currently using?
The softmax in the last layer of D looks a bit strange, as you would usually use raw logits or nn.LogSoftmax for nn.CrossEntropyLoss or nn.NLLLoss, respectively.

asberman · November 26, 2018, 8:16pm

Yes, the fc is a bit strange, but the paper I’m referencing doesn’t mention any activation functions in those layers. Will definitely experiment with adding activation functions and seeing if that makes a difference.

I’m currently using nn.BCELoss for my primary GAN loss (i.e. the real vs. fake loss), and a nn.CrossEntropyLoss for an additional multi-label classification loss. LSGAN uses nn.MSELoss instead, but that’s the only meaningful difference between it and other (e.g. DC)GAN.

Hmm, I’ll investigate using nn.LogSoftmax and see the results; out of interest, why is nn.LogSoftmax usually used?

Thanks again for the help!

ptrblck · November 26, 2018, 9:28pm

That would be strange, as the linear layers are currently just a single one.

Sorry for not being clear enough.
For a basic multi-class classification task you would have two options:

raw logits (i.e. no non-linearity at the end of your model, just the linear layer outputs) + nn.CrossEntropyLoss
nn.LogSoftmax as the last layer + nn.NLLLoss.

Both options are equivalent, as nn.CrossEntropyLoss calls nn.LogSoftmax + nn.NLLLoss internally.

asberman · November 27, 2018, 8:35am

Yeah, it is pretty strange!

No worries! Thanks - will definitely investigate raw logits + nn.CrossEntropyLoss

nthn_clmnt · December 20, 2018, 3:24am

I think I have been running into the same problem in trying to implement WGAN-GP (never tried WGAN or LSGAN). I’ve had success on my own data with what the author calls ‘GoodGenerator’ and ‘GoodDiscriminator’ on his github (I took from the nice Pytorch implementation here). You should definitely try those networks if you haven’t already.

I got decent results but after adding a few more residual blocks I am running into the instability you describe. I don’t really pay much attention to -D(G(z)) aka loss_G, but I do watch D(G(z)) - D(x) and I see behavior like you’re describing- D is doing fine, then a crash. D seems to recover its margin but after that the samples I get from G let me know the thing is broken.

In a smaller pair of networks, I once cured this problem by totally removing normalization from D, but maybe that was a fluke just because it was a small network.

The thing I am most curious about is which network is screwing up first, and next: why?.
I’ll be interested to know if you figure anything out about this instability, and if your generator makes any good samples after the breakdown.