I encounter a weird inconsistency in the result of `nn.Conv2d`

operation on CPU and AMD GPU.

Here is the snippet showing the drastic difference in an environment with ROCm:

```
>>> import torch
>>> torch.__version__
'2.0.0+rocm5.4.2'
>>> layer = torch.nn.Conv2d(1, 768, kernel_size=16, stride=10)
>>> x = torch.rand(10, 1, 128, 66)
>>> layer.to('cuda:0')(x.to('cuda:0')).sum()
tensor(10210.0078, device='cuda:0', grad_fn=<SumBackward0>)
>>> layer.to('cpu')(x.to('cpu')).sum()
tensor(2468.2070, grad_fn=<SumBackward0>)
```

Locally, I have an NVidia GPU and here is the snippet with the results with CUDA:

```
>>> import torch
>>> torch.__version__
'2.0.0'
>>> layer = torch.nn.Conv2d(1, 768, kernel_size=16, stride=10)
>>> x = torch.rand(10, 1, 128, 66)
>>> layer.to('cuda:0')(x.to('cuda:0')).sum()
tensor(244.4510, device='cuda:0', grad_fn=<SumBackward0>)
>>> layer.to('cpu')(x.to('cpu')).sum()
tensor(244.4510, grad_fn=<SumBackward0>)
# `torch.allclose` returns True
```

Interestingly, `nn.Linear`

does not exhibit this behaviour, ie the numbers match.