[nonissue] Autograd fails when using half-precision - overflow on matrix size


(Dimitry Pletnikov) #1

I got Titan V and have been experimenting with half-precision.

In half-precision mode I can’t backpropagate through matmul of two all-zeros matrices, because the number of elements in the resulting matrix is outside of half-precision range.

I am getting the same error if I use Conv1d or Conv2d or bmm.

This minimal computation graph replicates the problem:

import torch, torch.autograd, torch.nn,numpy
with torch.cuda.device(0):

    test_input = torch.autograd.Variable(torch.zeros(257, 509)).cuda().half()
    test_w = torch.nn.Parameter(torch.zeros(509,263)).cuda().half()

    matmul_result = torch.matmul(test_input, test_w)
    print(matmul_result.size())
    print(numpy.prod(matmul_result.size()))

    test_output = matmul_result.abs().mean()
    test_output.backward()

And the result is:

torch.Size([257, 263])
67591
Traceback (most recent call last):
  File "<stdin>", line 11, in <module>
  File "/home/dzmitry/miniconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/dzmitry/miniconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: value cannot be converted to type Half without overflow: 67591

I am using PyTorch 0.3 and could replicate the issue with CUDA 8 and CUDA 9.


(Dimitry Pletnikov) #2

I found my mistake. It actually fails because I use .mean(), which of course has the total number of elements in it.


(Rahul Bhalley) #4

So what is the solution to this problem? I’m still stuck! Would you recommend any changes to mean() code?


#5

Reductions are sensitive to overflow if you are using FP16.
You should perform all reductions in FP32 just to make sure to get a valid result.
Just call .float() on your tensor before passing it to torch.mean().
Autograd will rewind this operation in the backward pass so that your model will still be in half precision.