Conv*d returns drastically different results on ROCm vs CPU

I encounter a weird inconsistency in the result of nn.Conv2d operation on CPU and AMD GPU.

Here is the snippet showing the drastic difference in an environment with ROCm:

>>> import torch
>>> torch.__version__
'2.0.0+rocm5.4.2'
>>> layer = torch.nn.Conv2d(1, 768, kernel_size=16, stride=10)
>>> x = torch.rand(10, 1, 128, 66)
>>> layer.to('cuda:0')(x.to('cuda:0')).sum()
tensor(10210.0078, device='cuda:0', grad_fn=<SumBackward0>)
>>> layer.to('cpu')(x.to('cpu')).sum()
tensor(2468.2070, grad_fn=<SumBackward0>)

Locally, I have an NVidia GPU and here is the snippet with the results with CUDA:

>>> import torch
>>> torch.__version__
'2.0.0'
>>> layer = torch.nn.Conv2d(1, 768, kernel_size=16, stride=10)
>>> x = torch.rand(10, 1, 128, 66)
>>> layer.to('cuda:0')(x.to('cuda:0')).sum()
tensor(244.4510, device='cuda:0', grad_fn=<SumBackward0>)
>>> layer.to('cpu')(x.to('cpu')).sum()
tensor(244.4510, grad_fn=<SumBackward0>)
# `torch.allclose` returns True

Interestingly, nn.Linear does not exhibit this behaviour, ie the numbers match.

It seems to be a bug and is now tracked here: Conv2d returns drastically different results on ROCm (MI250X) vs CPU · Issue #102968 · pytorch/pytorch · GitHub

It was suggested to turn off implicit GEMM by setting MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0