Variable.data.copy VS variable_a = variable_b

Hi, I wonder the difference between the following 2 lines.

 var_a.data.copy(var_b.data)
 var_a = var_b

Questions:

  1. Will var_a = var_b + 1 create a new tensor for var_a?
  2. var_a.data.copy(var_b.data) is sort of numerical copy and is NOT differentiable?
  • var_a = var_b will just make the python variable var_a contain the same tensor as the python variable var_b. Whatever was contained in var_a is discarded.
  • var_a.data.copy(var_b.data) this will copy the content of the tensor var_b into the tensor contained in the python variable var_a.
  1. what happens is that var_b + 1 creates a new tensor containing the result, and then this tensor is associated with the python variable var_a.
  2. The use of .data here breaks the autograd engine. This means that this operation won’t be tracked and so gradient computation for var_a and var_b might be wrong as some of the operations you perform on them are not recorded.
    If you were doing var_a.copy_(var_b), Then this is a differentiable operation: the gradient for the original values in var_a is just 0 everywhere (as it is independant of the output), and the gradients for the elements in var_b are 1.
2 Likes

Hi, @albanD Thanks so much for your detailed illustration.

More question:
What’s the typical application of var_a.copy_(var_b), why we don’t just use var_a = var_b ??

One valid case is when you other references to var_a and want to let those use the new values too.

The copy changes elements in place, so that is useful when you want to do that:

big_tensor = torch.rand(batch, n_chan+1, dim)

# I want to fill the last channel with the mean of the others (as an example):
last_channel = big_tensor.select(1, -1) # Get the last channel in-place.
last_channel.copy_(big_tensor.narrow(1, 0, n_chan).mean(1)) # Change the last channel

# Now you can use big_tensor, where the last channel has been changed.
out = my_net(big_tensor)