Why does autograd::engine::evaluate_function: CudnnConvolutionBackward0 takes too much host time?

Kenan_Yip · November 6, 2022, 2:43pm

I ran my training code on different machines and got varied time consumption, from 20min per epoch to 60min per epoch. (AMD 5950x and Intel 4210R)
Thus, I used the torch.profiler to check the bottleneck. While from the overview summary it seems to be the CPU bottleneck, the operator profile shows top 4 cudnn backward operations contributing most device total time duration with 0 self-time duration (3 in 4 operators).
Does this mean they are waiting for the CPU?
Additionally, my code is using GRU cells with for loop to iterate sequences, and I found a page similarly suffering from autograd overhead (Why is my training bottlenecked on CPU? - #2 by smth).
I understand in my case the CPU performance gap seems to be the dominant cause, I still wondering if there’s any chance to reduce this executing time gap.

Category	Time Duration (us)	Percentage (%)
Average Step Time	8,347,976	100
Kernel	1,211,537	14.51
Memcpy	73,742	0.88
Memset	7,329	0.09
Runtime	574,164	6.88
DataLoader	0	0
CPU Exec	5,360,633	64.21
Other	1,120,572	13.42

Name	Calls	Device Self Duration (us)	Device Total Duration (us)	Host Self Duration (us)	Host Total Duration (us)	Tensor Cores Eligible	Tensor Cores Total(%)
autograd::engine::evaluate_function: CudnnConvolutionBackward0	3924	0	1386348	128347	1348531	Yes	3.53	View CallStack
aten::cudnn_convolution_backward	3924	0	1375288	93613	1157994	Yes	3.55	View CallStack
CudnnConvolutionBackward0	3924	0	1375288	58735	1216729	Yes	3.55	View CallStack
aten::cudnn_convolution_backward_weight	3924	1119203	1119203	301089	520406	Yes	0

CCLDArjun · March 12, 2024, 8:37pm

@Kenan_Yip were you able to find a solution? I’m having a similar issue with autograd::engine::evaluate_function: CudnnConvolutionBackward0 taking a lot of time