Autograd.grad works strange with increasing batch_sz

I am training GAN with gradient penalty for discriminator w.r.t. real images.

In my code I have snippet like this:

real_scores = discriminator(real_imgs)
r1_penalty = torch.autograd.grad(real_scores.sum(), real_imgs, retain_graph=True, create_graph=True)[0]
r1_penalty = torch.mean(r1_penalty ** 2)
loss = loss + self.r1_gamma * r1_penalty

The strange thing about this code is how the processing time grows with batch size: the same amount of real images are processed 4 times slower with batch size >= 128 (e.g. 128, 256 or 512) then the batch size 64. If I remove this gradient penalty then all timings are pretty much the same.

That’s pretty strange for me. Any ideas how is it possible?