None grad attribute in multiprocessing autograd with FloatTensor type

Maxime_Louis · June 29, 2018, 3:40pm

Hi,

I’m using the multiprocessing module to perform computations and their gradients on different processes before gathering the results. The code works when the tensor type used is DoubleTensor, but not when it is FloatTensor. Here is a minimal example that reproduces the problem:

import numpy as np
import torch
from torch.multiprocessing import Pool


def foo_double(np_array):

    tensor = torch.from_numpy(np_array).requires_grad_(True).type(torch.DoubleTensor)
    squared_norm = torch.sum(tensor**2)
    squared_norm.backward()

    return tensor.grad


def foo_float(np_array):

    tensor = torch.from_numpy(np_array).requires_grad_(True).type(torch.FloatTensor)
    squared_norm = torch.sum(tensor**2)
    squared_norm.backward()

    return tensor.grad


if __name__ == '__main__':
    array_1 = np.ones(1)
    array_2 = np.ones(1)

    args = [array_1, array_2]

    with Pool(processes=2) as pool:
        results = pool.map(foo_double, args)  # works as expected

    # I get [tensor([ 2.], dtype=torch.float64), tensor([ 2.], dtype=torch.float64)]
    print(results)

    with Pool(processes=2) as pool:
        results = pool.map(foo_float, args)  # Gradient is None !

    # I get [None, None]
    print(results)

Is this the expected behaviour and if so why ? Also, is this kind of multiprocessing to get gradients in separate process good practice with the autograd mechanics ?

ptrblck · June 29, 2018, 8:13pm

Thanks for the example.
Because you cast the tensor after setting requires_grad_ on it, this weird behavior might be happening.
Since your numpy array is already a np.float64, your DoubleTensor cast would probably be a no-op, thus masking the behavior.
In your second call using a FloatTensor your tensor gets a new grad_fn named CopyBackwards.
It’s still strange, that the grad is None, so this might still be a bug.

You could fix it currently by setting requires_grad_ after casting the tensor:

def foo_double(np_array):

    tensor = torch.from_numpy(np_array).type(torch.DoubleTensor)
    tensor.requires_grad_(True)
    squared_norm = torch.sum(tensor**2)
    squared_norm.backward()

    return tensor.grad


def foo_float(np_array):

    tensor = torch.from_numpy(np_array).type(torch.FloatTensor)
    tensor.requires_grad_(True)
    squared_norm = torch.sum(tensor**2)
    squared_norm.backward()

    return tensor.grad

@SimonW or @albanD could you please have a look at this?
It is a bug or is it wanted behavior?
A small example:

x = torch.randn(1, requires_grad=True, dtype=torch.double)
tensor = x.float()
squared_norm = torch.sum(tensor**2)
squared_norm.backward()
print(tensor.grad_fn)
> <CopyBackwards object at 0x7f06240ea290>
print(tensor.grad)
> None
print(x.grad)  
> tensor([-0.3383], dtype=torch.float64)

SimonW · June 29, 2018, 10:45pm

It might be unintuitive, but I think this is the expected behavior.

In general, one should never do my_tensor_that_need_grad = create_tensor().requires_grad_().another_op(). You are more than likely getting an intermediate tensor which isn’t a leaf that requires grad.

albanD · July 2, 2018, 9:14am

I agree with Simon, this is expected behavior.
I guess it can be confusing that in the case where the type is already correct, the cast is a no-op and so it happens to work.

Maxime_Louis · July 2, 2018, 9:38am

Ok, thank you for your help !