Reusing input variables between models changes the computation graph? (reinforcement learning)

This is a basic question about how pytorch computation graphs work in a slightly nontrivial case. The specific code/algorithm following does not matter, but I don’t have better example on hand to illustrate my question so bear with me:

My observation comes from @kostrikovs PPO source code, in lines 197-211 in main.py. I’ve made some changes for the sake of simplicity to arrive at the following pseudo code, that illustrates the computation graph mechanics I’m unsure of:

output1 = model(Variable(batch))

output2 = old_model(Variable(batch))

ratio = torch.exp(output1 - Variable(output2.data))

optimizer.zero_grad()
ratio.backward()
optimizer.step()  # params = model.parameters()

My observation is that the above code works correctly, but this very similar code does NOT:

batch_var = Variable(batch)

output1 = model(batch_var)

output2 = old_model(batch_var)

ratio = torch.exp(output1 - Variable(output2.data))   

optimizer.zero_grad()
ratio.backward()
optimizer.step() # params = model.parameters()

I don’t understand why these routines produce different results. As far as I understand the line “Variable(output2.data))” in the definition of “ratio” should prevent backward() from propagating through the original “output2” computation graph and changing gradients that way. I also don’t see why simply sharing an input Variable between the two models should change the result of backward() (the forward() of “model/old_model” has no in place operations).

I assume something is missing from my understanding. Any insights?

2 Likes