I work on cuda8.0 and cudnn.

The version of pytorch is 0.4.0.

I find it is very slow when I apply gradient penalty (GP) for training cifar10 with resnet18.

I test the average running time of each step with and without GP:

without : 0.065s

with GP: 0.330s

Here is my code:

```
import torch
from torch import nn, autograd
from torch.autograd import Variable
from models import PreActResNet18
import time
net = PreActResNet18()
net.cuda()
opt = torch.optim.SGD(net.parameters(), 0.01, momentum=0.9, weight_decay=1e-4)
batchsize = 128
GP = True
start = time.time()
for i in range(100):
x = torch.rand((batchsize, 3, 32, 32)).cuda()
y = torch.randint(0, 10, (batchsize,)).cuda().long()
x, y = Variable(x, requires_grad=True), Variable(y)
opt.zero_grad()
preds = net(x)
loss_c = nn.CrossEntropyLoss()(preds, y)
if GP:
grad = autograd.grad(loss_c, x, create_graph=True, retain_graph=True,
only_inputs=True)[0]
loss_g = (grad ** 2).mean() * batchsize
(loss_c + loss_g).backward()
else:
loss_c.backward()
opt.step()
end = time.time()
print("{:.3f}s for each step".format((end-start)/100))
```

Is there better implementation?