Why does autograd::engine::evaluate_function: CudnnConvolutionBackward0 takes too much host time?

I ran my training code on different machines and got varied time consumption, from 20min per epoch to 60min per epoch. (AMD 5950x and Intel 4210R)
Thus, I used the torch.profiler to check the bottleneck. While from the overview summary it seems to be the CPU bottleneck, the operator profile shows top 4 cudnn backward operations contributing most device total time duration with 0 self-time duration (3 in 4 operators).
Does this mean they are waiting for the CPU?
Additionally, my code is using GRU cells with for loop to iterate sequences, and I found a page similarly suffering from autograd overhead (Why is my training bottlenecked on CPU? - #2 by smth).
I understand in my case the CPU performance gap seems to be the dominant cause, I still wondering if there’s any chance to reduce this executing time gap.

Category Time Duration (us) Percentage (%)
Average Step Time 8,347,976 100
Kernel 1,211,537 14.51
Memcpy 73,742 0.88
Memset 7,329 0.09
Runtime 574,164 6.88
DataLoader 0 0
CPU Exec 5,360,633 64.21
Other 1,120,572 13.42
Name Calls Device Self Duration (us) Device Total Duration (us) Host Self Duration (us) Host Total Duration (us) Tensor Cores Eligible Tensor Cores Self(%) Tensor Cores Total(%)
autograd::engine::evaluate_function: CudnnConvolutionBackward0 3924 0 1386348 128347 1348531 Yes 0 3.53 View CallStack
aten::cudnn_convolution_backward 3924 0 1375288 93613 1157994 Yes 0 3.55 View CallStack
CudnnConvolutionBackward0 3924 0 1375288 58735 1216729 Yes 0 3.55 View CallStack
aten::cudnn_convolution_backward_weight 3924 1119203 1119203 301089 520406 Yes 0 0