VAEGAN - Multiple losses and multiple networks training

KFrank · April 14, 2025, 4:29pm

Hi Martin!

glgmartin:

            enc_loss.backward(retain_graph=True)  # Need to retain for decoder training
            nn.utils.clip_grad_norm_(vaegan.encoder.parameters(), 5)
            opt_enc.step()

I haven’t looked at your code in detail, but the likely cause is as follows:

opt_enc.step() performs inplace modifications on the model parameters it is optimizing.

But you then perform a second .backward() on the graph that you retained and modified
inplace, hence the error.

It appears that opt_dec.step() does not optimize any of the encoder parameters. If this
is the case, then the encoder portion of the graph you backward through a second time in
dec_loss.backward() need not be in the decoder graph.

You are already computing dis_x_rec twice – the first time detached from the encoder
graph. Would it be possible to keep the detached version (instead of discarding it) and
use it to compute gen_loss and hence dec_loss? Then dec_loss.backward() won’t
re-traverse the inplace-modified encoder graph.

Of course, you should do some debugging to make sure that this is actually what is happening
(and that there aren’t similar errors that would have shown up after the the first error is raised).

You can find a discussion of how to debug and fix inplace-modification errors in this post.

I always recommend understanding what is going on and addressing it directly, but pytorch
does offer a sweep-inplace-modification-errors-under-the-rug context manager.

In reference to your second post:

creates a new tensor that doesn’t depend on x (and doesn’t carry requires_grad = True).

So when you backpropagate through the encoder, you backpropagate through self.mu
and self.var, but not through self.conv. If just parameters in self.conv are the ones that
cause problems when modified inplace, you’ve avoided the problem by not backpropagating
through them.

Best.

K. Frank