GroupNorm is slower and consumes higher GPU memory than BatchNorm

I use GroupNorm in pytorch instead of BatchNorm and keep all the others fixed. It shows that in Imagenet dataset, GroupNorm is 40% slower than BatchNorm, and consumes 33% more GPU memory than BatchNorm. I am really confused because GroupNorm shouldn’t need more calculation than BatchNorm. The details are listed below.
For BatchNorm, one minibatch consumes 12.8 seconds with GPU memory 7.51GB;
For GroupNorm, one minibatch consumes 17.9 seconds with GPU memory 10.02GB.