This is related to this post I made last week.
I try to code simple linear regression (y - Xw)^2 by:
1.manually computing derivative.
2. using only the " torch.autograd.
grad
" function.
I see a significant speed difference in that two different methods across different GPUs. Here’s my code and you can run it on google colab:
import torch
import time
from torch.autograd import grad
import torch.nn as nn
INPUT_SIZE = 256
HIDDEN_SIZE = 2048
BASIS_SIZE = [INPUT_SIZE, HIDDEN_SIZE]
batch_size = 100
device_name = 'cuda:0'
device = torch.device(device_name)
print("Running on " + device_name)
eta=0.01
X = torch.nn.Linear(BASIS_SIZE[1],BASIS_SIZE[0],bias=False).to(device)
target = torch.randn([batch_size, INPUT_SIZE]).to(device)
w = torch.FloatTensor(batch_size,HIDDEN_SIZE).fill_(0).to(device)
w.requires_grad = True
start_time = time.time()
for k in range(5000):
Res = -target+X(w)
loss = 0.5*(Res**2).sum()
w.data = w.data.add(-eta*X.weight.data.t().mm(Res.T).T)
manual_time = time.time()-start_time
print('without autograd -- Runtime: {}, loss: {}'.format(manual_time,loss))
X = torch.nn.Linear(BASIS_SIZE[1],BASIS_SIZE[0],bias=False).to(device)
target = torch.randn([batch_size, INPUT_SIZE]).to(device)
w = torch.FloatTensor(batch_size,HIDDEN_SIZE).fill_(0).to(device)
w.requires_grad = True
start_time = time.time()
for k in range(5000):
Res = -target+X(w)
loss = 0.5*(Res**2).sum()
w_grad = grad(outputs=loss, inputs=w)
w.data = w.data.add(-eta*w_grad[0].data)
auto_time = time.time()-start_time
print('Using autograd -- Runtime: {}, loss: {}'.format(auto_time,loss))
print('Using autograd is {}x slower than without autograd'.format(round(auto_time/manual_time,3)))
Result:
On my 1080Ti:
Running on cuda:2
without autograd -- Runtime: 1.0496978759765625, loss: 0.0001592347107362002
Using autograd -- Runtime: 2.8707797527313232, loss: 0.00017234217375516891
Using autograd is 2.735x slower than without autograd
On my cpu:
Running on cpu
without autograd -- Runtime: 6.226852655410767, loss: 0.00017741750343702734
Using autograd -- Runtime: 9.111348152160645, loss: 0.00020874488109257072
Using autograd is 1.463x slower than without autograd
On Tesla T4 provided in google colab:
Running on cuda:0
without autograd -- Runtime: 0.8135528564453125, loss: 0.00016563094686716795
Using autograd -- Runtime: 2.0092411041259766, loss: 0.000202339724637568
Using autograd is 2.47x slower than without autograd
Can someone confirm if I am using autograd correctly or an explanation on why the behavior differs across different device?