I ran my training code on different machines and got varied time consumption, from 20min per epoch to 60min per epoch. (AMD 5950x and Intel 4210R)
Thus, I used the torch.profiler to check the bottleneck. While from the overview summary it seems to be the CPU bottleneck, the operator profile shows top 4 cudnn backward operations contributing most device total time duration with 0 self-time duration (3 in 4 operators).
Does this mean they are waiting for the CPU?
Additionally, my code is using GRU cells with for loop to iterate sequences, and I found a page similarly suffering from autograd overhead (Why is my training bottlenecked on CPU? - #2 by smth).
I understand in my case the CPU performance gap seems to be the dominant cause, I still wondering if there’s any chance to reduce this executing time gap.
@Kenan_Yip were you able to find a solution? I’m having a similar issue with autograd::engine::evaluate_function: CudnnConvolutionBackward0 taking a lot of time