Locating the bug operator of resnet18

I am currently developing an operator for the AMD platform, and after comparing it with the operator on the CUDA platform, I found that they produce the same output. However, when I use this operator to construct ResNet18(load from torchvision.models.resnet18), I notice a significant difference in accuracy between the AMD and CUDA platforms after training. How can I quickly identify the problematic operator?

dataset: cifar10
acc: 12% on AMD, 65% on CUDA.
the settings and hyper parameters are all the same.