I’m using the multiprocessing module to perform computations and their gradients on different processes before gathering the results. The code works when the tensor type used is DoubleTensor, but not when it is FloatTensor. Here is a minimal example that reproduces the problem:
import numpy as np
import torch
from torch.multiprocessing import Pool
def foo_double(np_array):
tensor = torch.from_numpy(np_array).requires_grad_(True).type(torch.DoubleTensor)
squared_norm = torch.sum(tensor**2)
squared_norm.backward()
return tensor.grad
def foo_float(np_array):
tensor = torch.from_numpy(np_array).requires_grad_(True).type(torch.FloatTensor)
squared_norm = torch.sum(tensor**2)
squared_norm.backward()
return tensor.grad
if __name__ == '__main__':
array_1 = np.ones(1)
array_2 = np.ones(1)
args = [array_1, array_2]
with Pool(processes=2) as pool:
results = pool.map(foo_double, args) # works as expected
# I get [tensor([ 2.], dtype=torch.float64), tensor([ 2.], dtype=torch.float64)]
print(results)
with Pool(processes=2) as pool:
results = pool.map(foo_float, args) # Gradient is None !
# I get [None, None]
print(results)
Is this the expected behaviour and if so why ? Also, is this kind of multiprocessing to get gradients in separate process good practice with the autograd mechanics ?
Thanks for the example.
Because you cast the tensor after setting requires_grad_ on it, this weird behavior might be happening.
Since your numpy array is already a np.float64, your DoubleTensor cast would probably be a no-op, thus masking the behavior.
In your second call using a FloatTensor your tensor gets a new grad_fn named CopyBackwards.
It’s still strange, that the grad is None, so this might still be a bug.
You could fix it currently by setting requires_grad_ after casting the tensor:
It might be unintuitive, but I think this is the expected behavior.
In general, one should never do my_tensor_that_need_grad = create_tensor().requires_grad_().another_op(). You are more than likely getting an intermediate tensor which isn’t a leaf that requires grad.
I agree with Simon, this is expected behavior.
I guess it can be confusing that in the case where the type is already correct, the cast is a no-op and so it happens to work.