Pytorch relu6 use more gpu memory than relu

I found that torchvision.models.mobilenetv2+relu (i exchange all relu6 to relu) can support 450+ (450,3,224,224) batchsize in training (Tesla P40). However, the default mobilenetv2+relu6 can’t run in 384 batch size (384, 3, 224, 224). Why relu6 cost more gpu memory than relu? All activations use inplace=True mode. Pytorch 1.10.0