GAN Loss going Haywire

I am trying to Implement Spectral Normalization Paper for GAN. I removed Spectral Normalization
from Discriminator, just to see its effect on the performance of GAN. Running this way the loss
of Generator started going haywire. I got to know that, I need to add some regularizer in to
Discriminator, so I introduced BN layer, for each CONV layer.

I am still getting losses such as this:

disc loss 0.0 gen loss 15.3624906539917
disc loss 0.0 gen loss 20.93044662475586
disc loss 0.0 gen loss 21.227508544921875
disc loss 9.47e-05 gen loss 21.953622817993164
disc loss 9.21943e-07 gen loss 19.166460037231445
disc loss 0.0 gen loss 11.971349716186523
disc loss 3.78e-07 gen loss 21.656723022460938
disc loss 2.293e-08 gen loss 24.056575775146484
disc loss 0.0 gen loss 23.89940643310547
disc loss 6.073e-05 gen loss 20.947172164916992
disc loss 1.82e-08 gen loss 27.6265258789062

Where Discriminator Loss (BCE) is reaching 0, where as Generator Loss is high.
With Spectral Normalization Loss of both Desc and Generator are in range (0,2).

The Observation definitely suggests that Discriminator has become ‘too strong’. Is
there some trick which can be employed.

Nitin BAnsal

Hi Nitin,

great observation!
My very biased (ha) view on this is the following:
In fancy talk, the spectral normalization limits the l_2 norm of the operators represented by each layer of the discriminator (i.e.with Euclidean (aka l_2) distance for both the inputs and the targets we have |f(x)-f(y)|_l2 <= C |x-y|_l2, and C could e.g. be one for linear layers). In that sense it is very similar to the original WGAN,which bounded the coefficients and thus limited the l_1-l_infinity norm of each layer (i.e. where you measure the input distance in l_1 and and the output distance in l_infinity). In this sense, one great achievement of the SNGAN authors is that they found an efficient way to match input and output norms in order to have more tight control over the activation norms (the mismatch really means you don’t have good control). In contrast to that, WGAN-GP and SLOGAN attempt to enforce a Lipschitz constraint between input and output directly (without resorting to the individual layers). That has the advantage of matching the theoretical interpretation of Wasserstein distance more closely, but the enforcement is much much weaker as it is only on a small sample rather than on the operators directly.

I can also offer a very simple, self-contained SNGAN notebook and in the same folder there are toy implementations of WGAN-GP and SLOGAN penalties.

So basically, I see the leading options as

  • implement a layerwise constraint as in WGAN and SNGAN,
    for variety, one could try limiting the “rowwise” absolute sums, i.e. the
    l_\infty, l_\infty - norm of the operators, I’m sure someone tried that and published a paper and invented a new name, but I didn’t look. When you do, be careful about PyTorch transposing weights etc.
  • implement a “full model” constraint as in WGAN-GP and SLOGAN. People do all sorts of things here, e.g. DRAGAN only limits the norm in a neighbourhood of the data etc.

Best regards


Thanks Indeed Thomas!

That was an exhaustive explanation! I will definitely go through
the resources which you have pointed out! I can definitely try some
options which you have suggested.

Nitin Bansal

@tom I just had a follow up question which is in regards to Inception Score and Frechet Inception Score used to verify the images quality. We usually use an inception model pre-trained on Imagenet. But considering that we use the GAN for CIFAR10/100 dataset, which is different from Imagenet in terms of distribution, size. which means the Inception model used should be different for CIFAR datasets.
Any comments or suggestion on this front would be really appreciated.


I’m not much of an expert for the scoring, but if they are a proxy for “images quality”, I’d say that while it’s clear that Imagenet has higher quality pictures than CIFAR10/100, the criterion “quality” would be rather similar (“looking like natural photographs” as much as possible on the resolution). In that sense, I don’t see that as the gravest imperfection of these metrics.

Best regards