# Why is autograd slower than manually computing derivative？

This is related to this post I made last week.

I try to code simple linear regression (y - Xw)^2 by:
1.manually computing derivative.
2. using only the " `torch.autograd.` `grad`" function.

I see a significant speed difference in that two different methods across different GPUs. Here’s my code and you can run it on google colab:

``````import torch
import time
import torch.nn as nn
INPUT_SIZE = 256
HIDDEN_SIZE = 2048
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
batch_size = 100
device_name = 'cuda:0'
device = torch.device(device_name)
print("Running on " + device_name)
eta=0.01

X = torch.nn.Linear(BASIS_SIZE[1],BASIS_SIZE[0],bias=False).to(device)
target = torch.randn([batch_size, INPUT_SIZE]).to(device)
w = torch.FloatTensor(batch_size,HIDDEN_SIZE).fill_(0).to(device)
start_time = time.time()
for k in range(5000):
Res = -target+X(w)
loss = 0.5*(Res**2).sum()
manual_time = time.time()-start_time
print('without autograd -- Runtime: {}, loss: {}'.format(manual_time,loss))

X = torch.nn.Linear(BASIS_SIZE[1],BASIS_SIZE[0],bias=False).to(device)
target = torch.randn([batch_size, INPUT_SIZE]).to(device)
w = torch.FloatTensor(batch_size,HIDDEN_SIZE).fill_(0).to(device)
start_time = time.time()
for k in range(5000):
Res = -target+X(w)
loss = 0.5*(Res**2).sum()
auto_time = time.time()-start_time
print('Using autograd -- Runtime: {}, loss: {}'.format(auto_time,loss))

``````

Result:

On my 1080Ti:

``````Running on cuda:2
without autograd -- Runtime: 1.0496978759765625, loss: 0.0001592347107362002
Using autograd -- Runtime: 2.8707797527313232, loss: 0.00017234217375516891
``````

On my cpu:

``````Running on cpu
without autograd -- Runtime: 6.226852655410767, loss: 0.00017741750343702734
Using autograd -- Runtime: 9.111348152160645, loss: 0.00020874488109257072
``````

On Tesla T4 provided in google colab:

``````Running on cuda:0
without autograd -- Runtime: 0.8135528564453125, loss: 0.00016563094686716795
Using autograd -- Runtime: 2.0092411041259766, loss: 0.000202339724637568
``````

Can someone confirm if I am using autograd correctly or an explanation on why the behavior differs across different device?

CUDA operations are executed asynchronously, so you would need to synchronize the code via `torch.cuda.synchronize()` manually before starting and stopping the timers.
Alternatively, you could also use `torch.utils.benchmark`, which will add warmup iterations and synchronize the code for you.

I modify my code and and wrote it using `torch.utils.benchmark` with warmup iterations, however, I still see the same behavior, using `autograd` is much slower than calculating gradient manually, and the difference depends on the gpu you use:

my code:

``````import timeit
import time
import torch.nn as nn
import torch
INPUT_SIZE = 256
HIDDEN_SIZE = 2048
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
batch_size = 100
device_name = 'cuda:2'
device = torch.device(device_name)
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
X = torch.nn.Linear(BASIS_SIZE[1],BASIS_SIZE[0],bias=False).to(device)
target = torch.randn([batch_size, INPUT_SIZE]).to(device)

setup = '''import torch
import time
import torch.nn as nn
INPUT_SIZE = 256
HIDDEN_SIZE = 2048
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
batch_size = 100
device_name = 'cuda:2'
device = torch.device(device_name)
eta=0.01

w = torch.FloatTensor(batch_size,HIDDEN_SIZE).fill_(0).to(device)

Res = -target+X(w)
loss = 0.5*(Res**2).sum()
'''
Res = -target+X(w)
loss = 0.5*(Res**2).sum()

t0 = timeit.Timer(
setup=setup,
globals={'X': X,'target':target})

t1 = timeit.Timer(
setup=setup,
globals={'X': X,'target':target})
num_iter = 2000
# warmUP
t0.timeit(num_iter)
noautograd_time = t0.timeit(num_iter) / num_iter * 1e6
# warmUP
t1.timeit(num_iter)
autograd_time = t1.timeit(num_iter) / num_iter * 1e6
``````

Result (on 1080Ti):

``````noautograd:  165.49 us
``````

Result (on Tesla T4):

``````noautograd:  146.71 us
Thanks for the code. It seems you are using `timeit` instead of `toch.utils.benchmark`.
I have one last question. If I do want to calculate the gradient of a single parameter in a large model, would `torch.autograd` be my best bet (in terms of performance), or is there better method? I am trying to do something like feature visualization of CNN, where you update the input image but fixing the parameter of a pretrained CNN model.