Error running multiple models in Torch 1.7.1 but works in Torch 1.0

mancunian1792 · November 26, 2020, 5:25am

I am trying to implement the cluster GAN architecture in Pytorch. The following steps work in Pytorch 1.0 but not in torch 1.7.0+cu101

optimizer_ge = Adam(itertools.chain(encoder.parameters(), generator.parameters()) ....)
opt_disc = Adam(discriminator.parameters() .....)

The generator and the encoder are updated together and the discriminator is updated separately.

The following is done for each batch of images

generator.train()
encoder.train()

generator.zero_grad()
encoder.zero_grad()
discriminator.zero_grad()

optimizer_ge.zero_grad()

fake_image = generator(random_z)
fake_op = discriminator(fake_image)
real_op = discriminator(real_image)
zn, zc, zc_idx = encoder(fake_image)

ge_loss = (Cross_entropy loss) + (Clustering_loss) 
ge_loss.backward(retain_graph=True)
optimizer_ge.step()

opt_disc.zero_grad()
# Compute vannila gan discriminator loss disc_loss using bce loss function
disc_loss.backward()
opt_disc.step()

The above code works fine in torch 1.0 but torch 1.7 throws the following error.

one of the variables needed for gradient computation has been modified by an inplace operation: 
[torch.cuda.FloatTensor [64, 1, 4, 4]] is at version 2; expected version 1 instead. 
Hint: enable anomaly detection to find the operation that failed to
 compute its gradient, with torch.autograd.set_detect_anomaly(True).

The error seems to be resolved when I do

fake_op = discriminator(fake_image.detach())

or

ge_loss.backward(retain_graph=True)
disc_loss.backward()
optimizer_ge.step()
opt_disc.step()

However, the results after doing the above changes aren’t matching up with the results of the code run in torch 1.0

Can someone help me in debugging this?
Thanks.

ptrblck · November 28, 2020, 6:21am

Could you check, if you might be facing a similar issue as described here?

mancunian1792 · November 28, 2020, 4:33pm

Hi @ptrblck. Thanks for the reply. I took a look at the thread.
I actually want to implement the way you suggested in that thread but currently failing to do that.

opt1.zero_grad()
loss1.backward(retain_graph=True)
opt1.step()

opt2.zero_grad()
loss2.backward()
opt2.step()

The above fails in torch 1.7 but works in torch 1.0

ptrblck · November 30, 2020, 12:48am

If you call loss2.backward after opt1.step(), the parameters used to calculate loss2 were already updated and thus loss2 would be stale.
The proper way would be to execute a new forward pass to compute loss2 and call loss2.backward() afterwards.

mancunian1792 · November 30, 2020, 11:34pm

opt1.zero_grad()
opt2.zero_grad()
loss1.backward(retain_graph=True)
loss2.backward()
opt1.step()
opt2.step()

Does it need to be modified like this?

ptrblck · December 1, 2020, 5:06am

Yes, this approach should work fine.

mancunian1792 · December 1, 2020, 5:24am

Thanks. Just to understand a few things better, Did something underlying with the way autograd works change between torch 1.0 and torch 1.7?

The PyTorch implementation of cluster GAN architecture (torch 1.0) uses the following way to update their networks.

opt1.zero_grad()
loss1.backward(retain_graph=True)
opt1.step()

opt2.zero_grad()
loss2.backward()
opt2.step()

The above doesn’t work in torch 1.7
I am not able to reproduce the results using the following suggested change.

opt1.zero_grad()
opt2.zero_grad()
loss1.backward(retain_graph=True)
loss2.backward()
opt1.step()
opt2.step()

Do you have any insights as to why this could happen?

ptrblck · December 1, 2020, 7:23am

Yes, the inplace updates of parameters are raising an error now, if you are using stale gradients as described in the 1.5 release notes (described in the torch.optim optimizers changed to fix in-place checks for the changes made by the optimizer section).

The reason is that the gradient computation would be incorrect. In your example you would calculate loss1 and loss2 using the model parameters in the initial state s0. loss1.backward() calculates the gradients and opt1.step() updates the parameters to state s1.
loss2.backward() was computed using the model in state s0 and would thus calculate the gradients of loss2 w.r.t. parameters s0, while the model is already updated to s1. These gradients would thus be wrong and the error is raised.

jsktt01 · June 5, 2023, 2:03pm

Hi @ptrblck, I have a similar but different problem. I want to calculate the loss2 based on params s1, and get the gradients w.r.t s0, then update s0 with the gradients. But the there is a problem when calculating the gradients, which was expressed in the link. Could you please give me some suggestions? Thank you.

ptrblck · June 5, 2023, 4:29pm

I’ve answered in the linked topic.