Why not switching between train and eval mode between the networks in DCGAN?

ZimoNitrome · September 12, 2021, 11:34am

Looking at the PyTorch implementation of DCGAN but also GANs in general…

To me it would seem intuitive to set G in train and D in eval when training G, and vice versa while training D. However, I don’t see this done in any GAN implementations.

Not switching to train/eval mode, it would seem that doing backprop on D(G(noise)) would train both D and G simultaneously. Is this incorrect?

Furthermore, trying to switch between eval and train like I suggested above seems to destabilize the GAN, where D gets 100% accuracy for real samples but 0% accuracy for fake samples.

Maybe .eval() makes D too powerful or makes G too weak.
Maybe freezing D and G with param.requires_grad = False is better? Trying that out seems to make little difference if any to the original training scheme though.

ptrblck · September 13, 2021, 2:31am

No, since model.eval() and model.train() do not change the gradient calculation or the backpropagation in general. Instead the behavior of certain layers is changed. E.g. dropout layers are disabled during eval() and batchnorm layers will use their internal running stats to normalize the input activation.

I think this is the reason why this particular tutorial keeps both models in train() mode.

I don’t think this would make any difference, since the output of the generator is already detached when the discriminator is trained as seen here.

ZimoNitrome · September 13, 2021, 6:08am

Thanks for clearing up some points for me ptr

Further, like you said, the reason for not freezing G is because we detatch its output when training D to discriminate fake samples as fake. So freezing G at any time would be pointless.

However:

While this was my original sentiment, some further training by freezing D here and unfreezing D here seems to speed up the “convergence”.

If we look at the code I think this makes sense. What the DCGAN code does is essentially:

Train D that real → 1

Train D that fake → 0

Train D(G) that fake → 1

In one iteration it seems like we backprop D so that it should learn that fake → 0 but also that fake → 1 (in the case where we still have the graph of fake being produced by G). Maybe this would make D in general learn that fake → 0.5 or 0.25 or something like that. By freezing D at said points this might be mitigated.

I am currently at work so I can’t share trainig results but this does indeed look like an improvement.

On another note, can we still train networks then even if they are in eval mode?

ptrblck · September 13, 2021, 6:44am

Freezing the parameters of the discriminator could yield a speed improvement but shouldn’t change the convergence since the gradients in D are zeroed out in each iteration (but you are right that the gradients are computed without being used).

Yes, that’s possible.

ZimoNitrome · September 13, 2021, 9:15am

Aha, so this backprop is not affecting D’s parameters because we don’t call optimizerD.step() after it?

ptrblck · September 13, 2021, 8:48pm

Yes, exactly and in the new iteration the gradients in netD will be zeroed here. So you might see a speed difference but shouldn’t see any difference in the parameter updates or convergence.