Looking at the PyTorch implementation of DCGAN but also GANs in general…
To me it would seem intuitive to set G in train and D in eval when training G, and vice versa while training D. However, I don’t see this done in any GAN implementations.
Not switching to train/eval mode, it would seem that doing backprop on D(G(noise)) would train both D and G simultaneously. Is this incorrect?
Furthermore, trying to switch between eval and train like I suggested above seems to destabilize the GAN, where D gets 100% accuracy for real samples but 0% accuracy for fake samples.
Maybe .eval() makes D too powerful or makes G too weak.
Maybe freezing D and G with
param.requires_grad = False is better? Trying that out seems to make little difference if any to the original training scheme though.
model.train() do not change the gradient calculation or the backpropagation in general. Instead the behavior of certain layers is changed. E.g. dropout layers are disabled during
eval() and batchnorm layers will use their internal running stats to normalize the input activation.
I think this is the reason why this particular tutorial keeps both models in
I don’t think this would make any difference, since the output of the generator is already detached when the discriminator is trained as seen here.
Thanks for clearing up some points for me ptr
Further, like you said, the reason for not freezing G is because we detatch its output when training D to discriminate fake samples as fake. So freezing G at any time would be pointless.
While this was my original sentiment, some further training by freezing D here and unfreezing D here seems to speed up the “convergence”.
If we look at the code I think this makes sense. What the DCGAN code does is essentially:
Train D that real → 1
Train D that fake → 0
Train D(G) that fake → 1
In one iteration it seems like we backprop D so that it should learn that fake → 0 but also that fake → 1 (in the case where we still have the graph of fake being produced by G). Maybe this would make D in general learn that fake → 0.5 or 0.25 or something like that. By freezing D at said points this might be mitigated.
I am currently at work so I can’t share trainig results but this does indeed look like an improvement.
On another note, can we still train networks then even if they are in eval mode?
Freezing the parameters of the discriminator could yield a speed improvement but shouldn’t change the convergence since the gradients in
D are zeroed out in each iteration (but you are right that the gradients are computed without being used).
Yes, that’s possible.
Aha, so this backprop is not affecting D’s parameters because we don’t call
optimizerD.step() after it?
Yes, exactly and in the new iteration the gradients in
netD will be zeroed here. So you might see a speed difference but shouldn’t see any difference in the parameter updates or convergence.