NaNs in torch.nn.functional.grid_sample with Mixed Precision

Hi,
I get NaNs when I run my training in Mixed precision training mode.
I have pointed the error at the following place

torch.nn.functional.grid_sample(input, grid, padding_mode="zeros",
                    mode='bilinear',
                    align_corners=True,)

when I change the sampling mode to nearest, it does not happen anymore.

torch.nn.functional.grid_sample(input, grid, padding_mode="zeros",
                    mode='nearest',
                    align_corners=True,)

both sampling modes work fine in 32 bit training.
My inputs and the grid does not contain NaNs or Infs , what could be the possible reason for this and the solution.

Could you post the shapes as well as the min. and max. values for input and grid so that we can reproduce this issue, as this code works fine in float16:

input = torch.arange(4*4).view(1, 1, 4, 4).half().cuda()
print(input)

# Create grid to upsample input
d = torch.linspace(-1, 1, 8)
meshx, meshy = torch.meshgrid((d, d))
grid = torch.stack((meshy, meshx), 2)
grid = grid.unsqueeze(0) # add batch dim
grid = grid.half().cuda()

output = torch.nn.functional.grid_sample(input, grid, mode='bilinear')
print(torch.isfinite(output).all())

@ptrblck thanks for the kind feedback.
Actually on investigation Grid infact does have the inf as maximum value, sorry for the incorrect information above.
Then does it have undefined behavior or the following still applies according to the documentation ?

If grid has values outside the range of [-1, 1], the corresponding outputs are handled as defined by padding_mode. Options are .... 
input size: torch.Size([1, 32, 124, 384])
input min , max: 0.0,15.419422149658203
Grid size: torch.Size([1, 16, 200, 2])
Grid min, max: -0.9350162744522095,inf

and what could be done in this case

will it be valid to do following
grid[grid[..., :] > 1.0] = 2.0

It depends on your use case what the valid workaround would be.
I.e. why are Inf values in the grid and what would they mean?
If you think these grid values should use the padding, your workaround might work, on the other hand you might want to investigate why these Inf values are created and avoid them.

@ptrblck
Any pointers as to why grid would have Inf during mixed precision training only ?

following is the exact line where Infs happen in half precision

pts_2d = torch.mm(pts_3d_rect, torch.transpose(proj_matrix, 0, 1))

float16 can easily overflow if you are using values with a value close to the min. and max. values:

torch.finfo(torch.float16).max
> 65504.0

E.g. this code snippet overflows in the second approach and yields Infs in the result after applying torch.mm:

x = torch.randn(1024, 1024, device='cuda').half()
y = torch.mm(x, x)
print(torch.isfinite(y).all())
> tensor(True, device='cuda:0')

x = torch.randn(1024, 1024, device='cuda').half() * 2**13
print(torch.isfinite(x).all())
> tensor(True, device='cuda:0')

y = torch.mm(x, x)
print(torch.isfinite(y).all())
> tensor(False, device='cuda:0')

Since the grid would contain valid values in [-1, 1], this would usually not be a problem.
However, it seems you are expecting to work with large values (which would then use the padding values), so you might disable autocast for the grid sampling operation and the grid creation.

1 Like

thanks so much for the suggestion of disabling autocast , that indeed is nice solution.
and also i found out that it depends on the input data , it overflows for certain datasets (with larger values for camera intrinsics values for example.)

@ptrblck
actually coming back to this, because explicit autocasting does not sit well while converting the model to torchscript and complains

torch.jit.frontend.UnsupportedNodeError: function definitions aren't supported:
def foo(
    x: torch.Tensor, y: torch.Tensor
) -> torch.Tensor:
    with autocast(enabled=False):
       ..........
    return z

how can we make autocast context manager and torchscript conversion co-exist ?

Automatic mixed-precision is not supported yet in torchscript and we are working on it.
If your model doesn’t use data dependent conditions etc. you might be able to use tracing (although you should verify the correctness).