Rocm predict so slow!

Hi, everyone. I am meeting a really confusing question!
I use the Super Computing Center in China to accelerate my training process, which deploys the DCU Card and installs the pytorch ROCm whose version is 4.2(beta).
But I encountered a srange issue:
As the picture above shows, The value in left circle is time of out = model(in) , and the valu in right circle is time of backward().
In NVIDIA’s 960, the two value above are both 0.0x. I think this is so weired. The input that was feed to network is an array whose shape is (2500, 12).
Another, I test on two different Super Compute Center in China. They all have the same problem.
Actually, I know that maybe I have to make some optimization if I want to move from cuda to dcu. But I can’t find the solution and I wanna know why.
Thanks vvvvvvery much!!!