3 backprop vs 1 backprop for gan discriminator training

40uf411 · July 19, 2025, 11:37pm

I am trying to train a 3D gan using 2D discriminator that take slices of the original data.
And wanted to get your opinion on two points:
1- is it better to have 3 discriminators, one per plane. Or a single discriminator and takes the embedding of the plane as input.
2-my current implementation is something like this:

disc real training backprop
disc fake training backprop
r1 regularisation backprop
gen training backprop

What would the expected effect of summing up the losses and doing one back prop per model? which method is better.

Dhia-naouali · July 20, 2025, 9:47pm

Hii

for the first question
I would say each strategy has it’s drawbacks, if you went with 3 discriminators that will cost more compute and you’ll have to carefully balance them else you may have one discriminator dominating the others, given the GANs reputation, you’ll need careful tuning
if you chose to use one discriminator, how you design your aggregation will matter most

one idea I would suggest giving a shot:
for each layer of the 2D discriminator

pass all slices through the discriminator layer,
pass all slices as a 3D block through an NdLinear block, acting as a communication layer

repeat
I think NdLinear layers offer a good trade-off in this case, you’ll have regular communication point beetween D layers with minimal params (check the paper, of course other intermediate layers can do the trick too

for your second question
I would recommend summing all losses and back propagating once per model
(real_loss + fake_loss + r1_reg).backward(); gen_loss.backward()

KFrank · July 21, 2025, 12:09am

Hi Dhia (and Ali)!

As written, this won’t work (at least not without further explanation of the intended details).

When you call fake_loss.backward(), you will populate gen’s parameters with gradients
from fake_loss that won’t be correct for training gen. This will also free the computation
graph built during the forward pass through gen. (Calling gen_loss.backward() will
analogously populate disc with incorrect gradients.)

(However, you can call (real_loss + fake_loss + r1_reg).backward(), if done so solely
for the purpose of training disc, for example by calling opt_disc.step() after disc’s
gradients have been populated .)

Best.

K. Frank

Dhia-naouali · July 21, 2025, 12:50am

thanks for pointing that out
I didn’t mention the forward props
here is what I meant in more details

Discriminator step

# forward
real_logits = D(real_samples.requires_grad_()) # for r1
with torch.no_grad():
    fake_samples = G(noise)
fake_logits = D(fake_samples)

real_loss = ...
fake_loss = ...
r1 = ...
D_loss = real_loss + fake_loss + r1

D.zero_grad()
D_loss.backward()
D_optimizer.step()

Generator step

# forward
fake_samples = G(noise)
fake_logits = D(fake_samples)

G_loss = ...

G.zero_grad()
G_loss.backward()
G_optimizer.step()

40uf411 · July 21, 2025, 1:36pm

Thanks Dhia and Frank,

1-So far I managed to have some decent results with 3 discriminators, though, not perfect.

the images tend to have a lot of noise and they are not sharp; that on top of the ballencing issues, and params finetuning.
Working with one model is not done yet.

I think NdLinear layers offer a good trade-off in this case

I am not sure about the use of NdLinear here, because the 3d images are quite big in size and multiply that by the batch size and you will fill the memory quit rapidely.

As for the second one, the tradeoff is that the original nvidia code was doing 3 backprops for the disc, which is memory intensive and you have to retain the computation graph in memory for a long time, but, you do three steps.

To my knowledge loss accum, should lead to the same direction but in a lesser magnitude. So, I was a bit confused about their choice. Maybe they want to have the multi disc step methodology but without skipping any generator training iteration?