I’m not quite sure if this is the ideal forum for this question, but it seems the most likely place to get an answer.
I have been reimplementing the WGAN network for use in my own problem. I’ve gone through the paper and the GitHub code, and everything seems to match (in terms of operation order). However, when I clamp the parameters to the [-0.01, -0.01] cube, I end up with discriminator losses of -0.01 for both real and fake data, and total discriminator error in on the order of 10^-6 (i.e. precision differences).
When I increase the parameter clamping to [-0.1, 0.1], I get more varied losses (they’re not completely clamped), but they still hover around 0.1.
Basically I am finding that clamping the parameters of the network restricts my loss to that clamping interval as well. Hence, I can’t seem to understand the loss graphs from the paper where the discriminator loss starts around 1.5.
Has anyone else come across this phenomenon? The only part of WGAN that I’m not using is the custom weight initialization. Does that play a key role?
What the clamping wants to achieve is to bound the Lipschitz norm L of the mapping provided by the discriminator, i.e. you want |D(x_fake)-D(x_real)|<=L |x_fake-x_real| for some reasonable L, for example L = 1.
You could output the norms above to check and if your architecture is so that the clamp to [-0.01,0.01] is not appropriate, do increase the bound.
While the Lipschitz constant enforced by the clamping interval is not terribly important for theoretical purposes, there is a subtle interaction between the magnitude of the discriminator output, the clamping and the size of the gradients / optimization steps. The WGAN-GP article actually lists as one of their achievements that by controlling L directly, they reduce the necessity of tuning hyperparamters such as the clamping (even if they seem to need a different penalty one in the toy example, which I think stems from the non-convexity in the penalty).
Thanks for the detailed answer! I read through WGAN-GP and implemented the gradient penalty instead of using parameter clamping, but I’m now seeing exploding gradients. I’m using the paper’s parameters (notably lambda = 10), yet my D_real and D_fake losses are staying fairly low, on the order of -1e-1 or -1e-2, while the gradient penalty is on the order of 10. It’s actually quite strange: all of a sudden, the loss will plummet from -10 or so to -2000 and then keep jumping. Maybe it’s the use of the Adam optimizer (WGAN warns that momentum based optimizers can cause this, but their image depict much lower magnitudes). In any case, it seems I’ve traded one problem for another, while not really affecting the loss. Are you familiar with this effect with WGAN-GP?
It is worth noting that I’ve just added LayerNorm to my discriminator per the WGAN-GP paper’s suggestion (shout out to the devs who unknowingly had spot-on timing merging that PR into Master today). Perhaps that will resolve the exploding gradients (and losses going to -30k)?