Can't speed up depthwise conv with cudnn7102

I test the depthwise convolution in environment with and without cudnn 7102 by setting

However, for depthwise convolution, the running time is still the same, while for common convolution with groups is 1, the speed up by cudnn is visible.
It seems that cudnn over 7.0.0 has support groups convolution, but how to benefit from that ?
Any suggestion ?

Hi, looks like still no ways to speed up group convolution operations. I have tested even cuDNN 7.6.3 and PyTorch 1.3, the improvement is negligible. All the solutions provided in github are useless. If you found something helpful, please email me, thanks.