No Speedup with Depthwise Convolutions

I would expect the execution to be slower when groups=1 (or more specifically, I would expect it to be faster when groups is equal to the number of input channels). The nn.Conv2d docs page claims that is how you use depthwise convolutions in PyTorch.