Very slow for gradient penalty!

I work on cuda8.0 and cudnn.
The version of pytorch is 0.4.0.
I find it is very slow when I apply gradient penalty (GP) for training cifar10 with resnet18.
I test the average running time of each step with and without GP:
without : 0.065s
with GP: 0.330s

Here is my code:

import torch
from torch import nn, autograd
from torch.autograd import Variable
from models import PreActResNet18
import time

net = PreActResNet18()
opt = torch.optim.SGD(net.parameters(), 0.01, momentum=0.9, weight_decay=1e-4)
batchsize = 128

GP = True
start = time.time()

for i in range(100):
    x = torch.rand((batchsize, 3, 32, 32)).cuda()
    y = torch.randint(0, 10, (batchsize,)).cuda().long()
    x, y = Variable(x, requires_grad=True), Variable(y)
    preds = net(x)
    loss_c = nn.CrossEntropyLoss()(preds, y)

    if GP:
        grad = autograd.grad(loss_c, x, create_graph=True, retain_graph=True,
        loss_g = (grad ** 2).mean() * batchsize
        (loss_c + loss_g).backward()

end = time.time()
print("{:.3f}s for each step".format((end-start)/100))

Is there better implementation?

I am running into a very similar issue, have you heard anything back?