I was trying to find the reason why grouped convolution is implemented in PyTorch and whether it is based on a paper that shows the efficacy of this operation.
Are there any benefits other than being able to distribute parameters across different devices?
Can you please mention a few references that have discussed the reason why grouped convolution can be effective?
There are some papers using grouped convolutions.
Depthwise separable convolutions are used on the Xception paper and here. Both present grouped convolutions as an alternative to reduce the total number of network parameters and computational cost.
There are a bunch of other refs, but I cannot say I’ve read many and thus will not cite without knowledge
ShuffleNet also uses grouped convolutions, without going to the extreme of group size == 1. Again, the justification is the reduction in parameters and computations.