Pytorch inplace operations slow down the inference process on GPU

Hi,
I’m trying to add some in-place operations inside the forward function but it seems like these operations slows down the inference process.

My forward function is like this:

def forward(self, x):
    x = x.permute(0, 3, 1, 2)
    x = F.interpolate(x, [224, 224], mode="bilinear")
    x[:, 0, :, :] -= 149.89  # inplace operation
    x[:, 0, :, :] /= 37.35  # inplace operation
    x[:, 1, :, :] -= 113.11 # inplace operation
    x[:, 1, :, :] /= 37.16  # inplace operation
    x[:, 2, :, :] -= 130.63  # inplace operation
    x[:, 2, :, :] /= 37.48  # inplace operation
    x = self.model(x)
    return nn.Softmax(dim=1)(x)

As you can see, I add some inplace operations onto BGR channel. In this case, if I do inference on a 512~batch input, it takes 0.83s on GPU.

However, If I get rid of these inplace operations, the code would be:

def forward(self, x):
    x = x.permute(0, 3, 1, 2)
    x = F.interpolate(x, [224, 224], mode="bilinear")
    x = self.model(x)
    return nn.Softmax(dim=1)(x)

And it only takes 0.23s (almost 1/4 of the former one) to do the inference on the same 512 batch.

Is it normal? I can’t believe the inplace operation takes so much time. Any ideas to optimize the code? Thanks a lot.

Could you share the code you’ve used to profile these operations, please?