How to skip a part of the model for back propagation?

werthers · October 31, 2022, 3:12am

Hi there,
I’m new to pytorch and I’d like to implement a GAN model, I see there’s a tutorial DCGAN Tutorial — PyTorch Tutorials 1.12.1+cu102 documentation, it uses two optimizer to backward the parameters of generator and discriminator separately, so that the gradient of the generator loss function is not back propagated into the discriminator parameters, and the gradient of the discriminator loss function is not back propagated into the generator parameters.

But what if I want to build a more complicated model which need to share some lower layers between generator and discriminator, I can’t put the parameters of the generator and the discriminator into different optimizers because they share many parameters. How can I train such a model correctly and efficiently?

Here’s the sample network structure:

class Generator(nn.Module):
self.encoder: shared network between generator and discriminator
self.top: linear layer
def forward(self, input):
    return self.top(self.encoder(input))
class Discriminator(nn.Module):
self.encoder: shared network between generator and discriminator
self.top: linear layer
def forward(self, input):
    return self.top(self.encoder(input))
netG = Generator()
netD = Discriminator()

training loop:
############################
# (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
###########################
## Train with all-real batchoutput = netD(real_cpu).view(-1)
# Calculate loss on all-real batch
errD_real = criterion(netD(real_img), real_label)
## Train with all-fake batch
# Calculate D's loss on the all-fake batch
errD_fake = criterion(netD(netG(noise).detach()), fake_label)

# Calculate gradients for D in backward pass
errD = errD_real + errD_fake
errD.backward()

############################
# (2) Update G network: maximize log(D(G(z)))
###########################
netG.zero_grad()
label.fill_(real_label)  # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = netD(fake).view(-1)
# Calculate G's loss based on this output
errG = criterion(netD(netG(noise)), fake_label) #!!! We don't need the gradient of netD with respect to errG, but we still need the gradient of netG with respect to errG and the gradient of netD with respect to errD above, so how to skip back propagation of netD for this loss(under the setting that netG and netD share some lower layers)?
# Calculate gradients for G
errG.backward()

werthers · October 31, 2022, 3:26am

When I compute D_fake loss, the whole forward function is Discriminator_top(Discriminator_encoder(Generator_top(Generator_encoder(input)))), and I only need the gradient of Discriminator_top and Discriminator_encoder.

When I compute G loss, the whole forward function is Discriminator_top(Discriminator_encoder(Generator_top(Generator_encoder(input)))), and I only need the gradient of Generator_top and Generator_encoder.

Discriminator_encoder and Generator_encoder is the same network, so in every training step this shared encoder should receive backward gradient for D_fake loss (backward to Discriminator_encoder) and G loss (backward to Generator_encoder)

KFrank · October 31, 2022, 5:16pm

Hi Werther!

The best way I can think of doing this (if I correctly understand your use
case) is to break your backpropagation up into two pieces using two calls
to torch.autograd.grad() (rather than a single call to .backward()).

Here are two calls to autograd.grad() applied to a simplified version of
what I think you are trying to do:

>>> import torch
>>> print (torch.__version__)
1.12.0
>>>
>>> input = torch.tensor ([2.0, 10.0])
>>>
>>> p1 = torch.tensor ([1.0], requires_grad = True)   # separate-parameters version
>>> p2 = torch.tensor ([1.0], requires_grad = True)   # separate-parameters version
>>>
>>> o  = p1 * input + p2 * input**2   # run "model" -- vector output
>>> loss = o.sum()                    # scalar loss
>>> loss
tensor(116., grad_fn=<SumBackward0>)
>>>
>>> loss.backward()
>>> p1.grad
tensor([12.])
>>> p2.grad
tensor([104.])
>>>
>>> p = torch.tensor ([1.0], requires_grad = True)   # shared-parameter version
>>>
>>> o = p * input + p * input**2   # run shared-parameter "model"
>>> loss = o.sum()
>>> loss   # loss is the same
tensor(116., grad_fn=<SumBackward0>)
>>>
>>> loss.backward()
>>> p.grad   # grad gets accumulated for both halves of "model"
tensor([116.])
>>>
>>> p.grad = None
>>>
>>> o1 = p * input           # run shared-parameter "model" in two pieces -- first half
>>> o2 = o1 + p * input**2   # second half
>>> loss = o2.sum()
>>> loss   # loss is the same
tensor(116., grad_fn=<SumBackward0>)
>>>
>>> grad2 = torch.autograd.grad (loss, o1)       # gradient of second half of "model" with respect to o1
>>> grad2    # grad2, as expected, is not a scalar
(tensor([1., 1.]),)
>>> p.grad   # doesn't populate p.grad
>>> grad1 = torch.autograd.grad (o1, p, grad2)   # gradient of the first half of "model" with respect to p
>>> grad1    # same as p1.grad
(tensor([12.]),)

As an aside, I’m not convinced that it makes sense to share layers between
your generator and discriminator. The two networks are doing rather different
things, and although it may well make sense for them to both have a layer
of the same architecture, it’s not clear to me that such layers should have
the same values for their parameters.

One could argue that the “features” your encoder layer learns to produce
make sense for both the generation and discrimination process, and that
it is helpful to train the same set of parameters for both, favoring sharing
the encoder.

But, typically, you input some sort of random noise into your generator, from
which I assume the generator’s encoder produces features, whereas you
input a structured image (or whatever the samples are) – either real or fake,
but still not random noise – into the discriminator’s encoder. Even if you
want the encoder to produce similar features in both cases, my intuition tells
me that the parameters you would use for producing features from random
noise are likely to be quite different than those your would use for producing
(similar) features from a structured image.

I’m sufficiently skeptical of the shared-encoder idea that I would want to see
a comparison of the shared-encoder approach with the separate-encoder
approach that shows that the former actually works better.

Best.

K. Frank

werthers · November 1, 2022, 9:02am

Thank you, Frank, that’s exactly the case. Yes, I am doing a study to verify the effect of shared generator, discriminator representations using an encoder with huge capacity. So that’s why I would like to use a correct and efficient way to break backpropagation up into two pieces, I used to copy a frozen discriminator when I backward generator loss, but I think it may not be efficient enough.

I have not used the autograd.grad() method directly before, your solution inspired me a lot. And as it’s a big encoder and huge amount of data, I usually train with the AMP module, so I wonder if this solution suitable for use with AMP module?

o1 = p * input # run shared-parameter “model” in two pieces – first half
o2 = o1 + p * input**2 # second half
loss = o2.sum()

grad2 = torch.autograd.grad(loss, o1) # gradient of second half of “model” with respect to o1

grad1 = torch.autograd.grad(o1, p, grad2) # gradient of the first half of “model” with respect to p

Is it safe to convert the above codes to the following ones?

scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
o1 = p * input # run shared-parameter “model” in two pieces – first half
o2 = o1 + p * input**2 # second half
loss = o2.sum()

grad2 = torch.autograd.grad(scaler.scale(loss), o1) # gradient of second half of “model” with respect to o1

grad1 = torch.autograd.grad(o1, p, grad2) # gradient of the first half of “model” with respect to p

Thank you for your help!

Gratefully,

Werther

zhangxiangxiao · April 20, 2026, 8:28am

You can share model parameters by modifying the data member of nn.Parameter class. For example

import torch
m1 = torch.nn.Linear(2,2)
m2 = torch.nn.Linear(2,2)
for p1, p2 in zip(m1.parameters(), m2.parameters()):
    p2.data = p1.data
input = torch.randn(2)
output = torch.mean(m2(m1(input)))
output.backward()
print(m1.weight.data.untyped_storage().data_ptr())
print(m1.weight.grad.untyped_storage().data_ptr())
print(m2.weight.data.untyped_storage().data_ptr())
print(m2.weight.grad.untyped_storage().data_ptr())

On my machine I got the following output:

94167176361024
94167285115776
94167176361024
94167285114624

This means the two modules contain the same parameters in memory, but different gradients. What you asked with “skip part of the model”, can be done by sharing the parameters of encoder and discriminator in this way, run forward as normal, and run backward with the approriate inputs and retain_graph settings.

This works with torch.optim optimizers normally, i.e., just include all parameters – shared or not – when constructing the optimizer.

If you assemble multiple such models in a larger nn.Module, you probably want to ensure sharing is maintained by repeating the parameter sharing loop above in overloaded to(), load_state_dict(), and __init__() member functions.