Why is nn.Linear failing gradcheck? Is there a bug in gradcheck?

yxchng · May 2, 2018, 7:53am

My test:

x = torch.rand(256, 2, requires_grad=True)
y = torch.randint(0, 10, (256, ), requires_grad=True)
custom_op = nn.Linear(2, 10) 
res = torch.autograd.gradcheck(custom_op, (x, ))
print(res)

My result:

RuntimeError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[-0.1639, -0.4768,  0.3874,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.6892, -0.0894, -0.0894,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0894, -0.3278,  0.4321],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.3278,  0.1788, -0.2086]])
analytical:tensor([[-0.1609, -0.4752,  0.3847,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.6913, -0.0883, -0.0926,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.1128, -0.3249,  0.4321],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.3346,  0.1746, -0.2108]])

albanD · May 2, 2018, 11:45am

Hi,

The default values for the step in the gradcheck function are for double precision numbers.
If you cast all your tensors and modules to double, the check will work.
If you want to work with single precision, you will need to increase eps to 1e-3 or 1e-4.

yxchng · May 2, 2018, 11:59am

Why is the default double precision? Aren’t most computations in deep learning done using float32?

albanD · May 2, 2018, 12:01pm

The finite difference tests can be quite unreliable when done in single precision. Hence they are usually done in double precision.

yxchng · May 2, 2018, 12:01pm

I tried and it only works with 1e-2

yxchng · May 2, 2018, 12:10pm

I tried casting it to double but it fails with error

RuntimeError: Expected object of type torch.FloatTensor but found type torch.DoubleTensor for argument #4 'mat1'

Seems like linear only works with float32

albanD · May 2, 2018, 12:17pm

You need to change the module to work with double the same way you do it for tensors:
custom_op = nn.Linear(2, 10) .double().

yxchng · May 3, 2018, 3:16pm

Thanks. it works. Just wondering, are there some network that has to be trained with double tensor? I keep getting NAN for my network when I use float tensor but it works fine with double tensor.

Is it a sign that my model might have bug?

albanD · May 3, 2018, 3:27pm

I guess that means that your network requires very precise floats. In general, unless you’re doing very specific things, this should not be the case. You should check that you don’t have any operation that is sensible to numerical precision. Like dividing by something so small, it would be 0 in single precision.