Does pytorch optimize the group parameter in convs?

Does pytorch optimize the group parameter? So can efficient the MobileNet architecture.

no, group parameter right now is done via a naive for loop.

does this do depth wise convolution by setting group parameter = input data depth?

From pytorch doc, I couldnt find any information about this new feature, I can imagine that if I do

model = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=2),
            nn.MaxPool2d(kernel_size=3, stride=2), # output is 64 depth
            nn.Conv2d(64, 64, kernel_size=3, group=64), # will this layer use SpatialDepthWiseConvolution instead of group loop?
            ... # next layers

will this model use SpatialDepthWiseConvolution?


No, it won’t use that depthwise function, just the standard codepath for groups.
also, note that that implementation of depthwise convolutions is very naive, and will be as slow as setting groups to 64 in your case.

I didn’t really get you point here.
so if he used group=64 this doesn’t mean that each output channel is doing convolution with only one channel ?
I’m testing this function but the network is not converging so I really want to know if the code is working or not


Conv2d(256, 256, kernel_size=(3, 3), padding=(1, 1), groups=4, bias=False)
is almost twice slower (in a sequence of similar calls) than:
Conv2d(128, 128, kernel_size=(3, 3), padding=(1, 1), groups=1, bias=False)
while the number of operations is about the same. Is there a way to do any better? Is there a chance that in a near future a more efficient implementation than a naive loop becomes available?

Best regards,

Georges Quénot.

Assuming you are using cudnn, then the answer is: yes, more fast kernel will be implemented for more workloads in the future.
You could use torch.backends.cudnn.benchmark = True at the beginning of your script to use the cudnn heuristics, which should pick the currently fastest available kernel.
I would also recommend to use the latest cudnn version (

Note that benchmark=True will profile the kernels in the first run for each new input shape, so your profiling should start after a warmup phase.

You could also compare the current speed to the native implementation by disabling cudnn via torch.backends.cudnn.enabled = False.

1 Like