I am trying to train a 3D gan using 2D discriminator that take slices of the original data.
And wanted to get your opinion on two points:
1- is it better to have 3 discriminators, one per plane. Or a single discriminator and takes the embedding of the plane as input.
2-my current implementation is something like this:
disc real training backprop
disc fake training backprop
r1 regularisation backprop
gen training backprop
What would the expected effect of summing up the losses and doing one back prop per model? which method is better.
for the first question
I would say each strategy has it’s drawbacks, if you went with 3 discriminators that will cost more compute and you’ll have to carefully balance them else you may have one discriminator dominating the others, given the GANs reputation, you’ll need careful tuning
if you chose to use one discriminator, how you design your aggregation will matter most
one idea I would suggest giving a shot:
for each layer of the 2D discriminator
pass all slices through the discriminator layer,
pass all slices as a 3D block through an NdLinear block, acting as a communication layer
repeat
I think NdLinear layers offer a good trade-off in this case, you’ll have regular communication point beetween D layers with minimal params (check the paper, of course other intermediate layers can do the trick too
for your second question
I would recommend summing all losses and back propagating once per model
(real_loss + fake_loss + r1_reg).backward(); gen_loss.backward()
As written, this won’t work (at least not without further explanation of the intended details).
When you call fake_loss.backward(), you will populate gen’s parameters with gradients
from fake_loss that won’t be correct for training gen. This will also free the computation
graph built during the forward pass through gen. (Calling gen_loss.backward() will
analogously populate disc with incorrect gradients.)
(However, you can call (real_loss + fake_loss + r1_reg).backward(), if done so solely
for the purpose of training disc, for example by calling opt_disc.step() after disc’s
gradients have been populated .)
the images tend to have a lot of noise and they are not sharp; that on top of the ballencing issues, and params finetuning.
Working with one model is not done yet.
I think NdLinear layers offer a good trade-off in this case
I am not sure about the use of NdLinear here, because the 3d images are quite big in size and multiply that by the batch size and you will fill the memory quit rapidely.
As for the second one, the tradeoff is that the original nvidia code was doing 3 backprops for the disc, which is memory intensive and you have to retain the computation graph in memory for a long time, but, you do three steps.
To my knowledge loss accum, should lead to the same direction but in a lesser magnitude. So, I was a bit confused about their choice. Maybe they want to have the multi disc step methodology but without skipping any generator training iteration?