Why is autograd slower than manually computing derivative?

This is related to this post I made last week.

I try to code simple linear regression (y - Xw)^2 by:
1.manually computing derivative.
2. using only the " torch.autograd. grad" function.

I see a significant speed difference in that two different methods across different GPUs. Here’s my code and you can run it on google colab:

import torch
import time
from torch.autograd import grad
import torch.nn as nn
INPUT_SIZE = 256
HIDDEN_SIZE = 2048
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
batch_size = 100
device_name = 'cuda:0'
device = torch.device(device_name)
print("Running on " + device_name)
eta=0.01


X = torch.nn.Linear(BASIS_SIZE[1],BASIS_SIZE[0],bias=False).to(device)
target = torch.randn([batch_size, INPUT_SIZE]).to(device)
w = torch.FloatTensor(batch_size,HIDDEN_SIZE).fill_(0).to(device)
w.requires_grad = True
start_time = time.time()
for k in range(5000):
    Res = -target+X(w)
    loss = 0.5*(Res**2).sum()
    w.data = w.data.add(-eta*X.weight.data.t().mm(Res.T).T)
manual_time = time.time()-start_time
print('without autograd -- Runtime: {}, loss: {}'.format(manual_time,loss))


X = torch.nn.Linear(BASIS_SIZE[1],BASIS_SIZE[0],bias=False).to(device)
target = torch.randn([batch_size, INPUT_SIZE]).to(device)
w = torch.FloatTensor(batch_size,HIDDEN_SIZE).fill_(0).to(device)
w.requires_grad = True
start_time = time.time()
for k in range(5000):
    Res = -target+X(w)
    loss = 0.5*(Res**2).sum()
    w_grad = grad(outputs=loss, inputs=w)
    w.data = w.data.add(-eta*w_grad[0].data)
auto_time = time.time()-start_time
print('Using autograd -- Runtime: {}, loss: {}'.format(auto_time,loss))


print('Using autograd is {}x slower than without autograd'.format(round(auto_time/manual_time,3)))

Result:

On my 1080Ti:

Running on cuda:2
without autograd -- Runtime: 1.0496978759765625, loss: 0.0001592347107362002
Using autograd -- Runtime: 2.8707797527313232, loss: 0.00017234217375516891
Using autograd is 2.735x slower than without autograd

On my cpu:

Running on cpu
without autograd -- Runtime: 6.226852655410767, loss: 0.00017741750343702734
Using autograd -- Runtime: 9.111348152160645, loss: 0.00020874488109257072
Using autograd is 1.463x slower than without autograd

On Tesla T4 provided in google colab:

Running on cuda:0
without autograd -- Runtime: 0.8135528564453125, loss: 0.00016563094686716795
Using autograd -- Runtime: 2.0092411041259766, loss: 0.000202339724637568
Using autograd is 2.47x slower than without autograd

Can someone confirm if I am using autograd correctly or an explanation on why the behavior differs across different device?

CUDA operations are executed asynchronously, so you would need to synchronize the code via torch.cuda.synchronize() manually before starting and stopping the timers.
Alternatively, you could also use torch.utils.benchmark, which will add warmup iterations and synchronize the code for you.

I modify my code and and wrote it using torch.utils.benchmark with warmup iterations, however, I still see the same behavior, using autograd is much slower than calculating gradient manually, and the difference depends on the gpu you use:

my code:

import timeit
import time
from torch.autograd import grad
import torch.nn as nn
import torch
INPUT_SIZE = 256
HIDDEN_SIZE = 2048
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
batch_size = 100
device_name = 'cuda:2'
device = torch.device(device_name)
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
X = torch.nn.Linear(BASIS_SIZE[1],BASIS_SIZE[0],bias=False).to(device)
target = torch.randn([batch_size, INPUT_SIZE]).to(device)

setup = '''import torch
import time
from torch.autograd import grad
import torch.nn as nn
INPUT_SIZE = 256
HIDDEN_SIZE = 2048
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
batch_size = 100
device_name = 'cuda:2'
device = torch.device(device_name)
eta=0.01

w = torch.FloatTensor(batch_size,HIDDEN_SIZE).fill_(0).to(device)
w.requires_grad = True'''

no_autograd='''
Res = -target+X(w)
loss = 0.5*(Res**2).sum()
w.data = w.data.add(-eta*X.weight.data.t().mm(Res.T).T)
'''
autograd='''
Res = -target+X(w)
loss = 0.5*(Res**2).sum()
w_grad = grad(outputs=loss, inputs=w)
w.data = w.data.add(-eta*w_grad[0].data)'''


t0 = timeit.Timer(
    stmt=no_autograd,
    setup=setup,
    globals={'X': X,'target':target})

t1 = timeit.Timer(
    stmt=autograd,
    setup=setup,
    globals={'X': X,'target':target})
num_iter = 2000
# warmUP
t0.timeit(num_iter)
noautograd_time = t0.timeit(num_iter) / num_iter * 1e6
print(f'noautograd:  {round(noautograd_time,2)} us')
# warmUP
t1.timeit(num_iter)
autograd_time = t1.timeit(num_iter) / num_iter * 1e6
print(f'autograd:      {round(autograd_time,2)} us')
print('Using autograd is {}x slower than without autograd'.format(round(autograd_time/noautograd_time,3)))

Result (on 1080Ti):

noautograd:  165.49 us
autograd:      592.85 us
Using autograd is 3.583x slower than without autograd

Result (on Tesla T4):

noautograd:  146.71 us
autograd:      401.26 us
Using autograd is 2.735x slower than without autograd

Thanks for the code. It seems you are using timeit instead of toch.utils.benchmark.
Adding this utility to your code I get a ~2x slowdown, which might be expected given you are calculating the gradients manually for a single parameter and not in a larger model.
Autograd would thus add some overhead, which would be visible for tiny workloads (and if that’s your use case, I don’t see an issue manually calculating the gradients if needed).
I would assume to see a benefit using a larger model and specifying some internal parameters, as you would have to manually write the backward pass otherwise.

Thank you so much for the help~

I have one last question. If I do want to calculate the gradient of a single parameter in a large model, would torch.autograd be my best bet (in terms of performance), or is there better method? I am trying to do something like feature visualization of CNN, where you update the input image but fixing the parameter of a pretrained CNN model.