I have implemented by own convolution layer as a learning experience in extending PyTorch. Looking at the printouts for small tensors, my outputs and gradients appear to match those of the native PyTorch convolutions.
When I go to much larger tensors, I have to rely on an sum of absolute differences metric between my implementation’s outputs/grads and the builtin’s ones. I’ve found that while my forward pass outputs and my gradients w.r.t the inputs match the native implementation every time, my gradients w.r.t the weights and biases are often slightly off. The per-element error is fairly small for small tensors (<1e-5), but as the tensor size increases to something like a (1, 512, 128, 256) tensor, I see per-element avg. errors close to 1e-2 or higher.
When I defined all the input and grad output (i.e. the argument to backward
) elements to be whole numbers, the error disappears. Hence I thought it was a precision issue.
Sure enough, my implementation passes gradcheck
when all values are whole number float
s or double
s. It fails (which gradcheck
warns will happen), when using decimal float
s. Why does this happen?
Weirder still is that the errors when using float
s are often the same set values but move around the returned tensor depending on the run (e.g. all the SAD errors are 0.002 or 0.0039 with one set of convolution parameters, or 0.0625, 0.125 and 0.25 with another). I’d thought I was accidentally accessing uninitialized memory in my cuBLAS gemm call, but switching to double
s largely resolves the issue.
Am I looking at a precision issue? If so, why does it only affect weight and bias gradients and not input gradients?