Diffence of depthwise convolution in PyTorch and ONNX

Hi, I found a strange problem when I tried to speedup model inference using depthwise convolution. I constructed a mobilenet model with depthwise conv(Model A) and then replaced dw conv with standard conv(Model B). I expected Model A would be faster than Model B because of less params and less computation cost, but what I oberserved was that Model B ran faster than Model A. Then I exported these two models to onnx format and made a test. A miracle happened!!! Model B performed much slower than Model A(about 3 times). What a confused problem it was. Moreover, I tested these two models in MXNet. Model A was still slower than Model B as same as in PyTorch. So I guess whether there are some differences between PyTorch and onnxruntime in internal mechanism. HELP ME, PLEASEā€¦