How not to consider gradient for specific operation without copying tensors?

I have AutoEncoder consisting of encoder and decoder. I want to reconstruct image. So I am optimizing parameters of both encoder and decoder but at the same time I need to use encoder layers (Usually you use VGG) for computing reconstruction loss. But for second encoder pass, I dont want to consider gradient. torch.no_grad() completely discards graph. Here is what code should have looked like:

latent = encoder(x)
x2 = decoder(x)
with torch.no_grad():
    latent2 = encoder(x2)
loss = F.mse_loss(x2, x) + F.mse_loss(latent2, latent)
loss.backward()

Problem is not that its not possible. It can be done by just copying encoder but that’s memory and compute inefficient as its being done for each iteration.

latent = encoder(x)
x2 = decoder(x)
latent2 = clone(encoder)(x)
loss = F.mse_loss(x2, x) + F.mse_loss(latent2, latent)
loss.backward()

Hi, you can detach a tensor from the computational graph with latent2.detach(). You can find more details here.

loss = F.mse_loss(x2, x) + F.mse_loss(latent2.detach(), latent)

This way decoder will not be optimized for latent loss. Only MSE loss will be used to optimize the decoder.
Lets suppose I make a copy of encoder in each iteration. Here are computation graphs:
x → encoder → latent
x → encoder → decoder → x2
x → encoder → decoder → encoder_copy → latent2
Optimizer has only encoder and decoder parameters and not encoder_copy parameters. So the gradient will be calculated for parts that are inside square brackets:
x → [encoder] → latent
x → [encoder → decoder] → x2
x → [encoder → decoder] → encoder_copy → latent2
encoder_copy is not being optimized. I dont want to copy encoder in each iteration. But I will end up calculated gradient for encoder in second forward pass which I dont want (it will cause model to diverge). Here is what I end-up with
x → [encoder] → latent
x → [encoder → decoder] → x2
x → [encoder → decoder → encoder] → latent2

But what I want:
x → [encoder → decoder] → encoder → latent2