Speedup depthwise conv with AMP/Tensor cores on GPU Nvidia V100

Hello,

I am right now training with AMP a vision model which is using a lot of depthwise convolution, and I am trying to speedup the training.
The profile indicates that most of GPU time is used for kernel related to depthwise convolution and that it is not using Tensor Core.

Does anyone know if it is possible to speedup depthwise convolution with tensor core ? Thanks a lot!