I recently updated to the master branch on 08/03/2018, Pytorch version: 0.4.0a0+363de58
, so that I could take advantage of torch.tensor(data, ... , requires_grad=True)
. I went through some of the tutorials to ensure that I was getting equivalent answers.
However, I found that in the basic neural network autograd example that my loss functions were arriving at slightly different outcomes (out by a percent or so) despite using exactly the same initialisations. This only got worse as I changed the size of the initial parameters (losses were out by around 10% or more).
I also found that within the for loop, you can no longer do loss.backwards()
for more than one iteration. This is because the gradient weights get set to None
after the first iterations and you have to recast them as variables
or torch.tensors()
separately for each iteration.
In addition to this, the use of python
special methods, such as _iadd_
i.e w1 += learning_rate*grad_w1
or w1 -= learning_rate*grad_w1
is no longer possible as it returns: RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.
I understand that as this has only been implemented recently the likelihood of bugs is high. It may also be because of me, but I can’t see my error, if it is me.
Please find the code to recreate this below. OS: Mac El Captain
, Python 3.6
, Pytorch version: 0.4.0a0+363de58
.
# -*- coding: utf-8 -*-
import torch
import copy
dtype = torch.FloatTensor
def simple_nn_torch():
'''
A simple nn in torch following the pytorch tutorial. Manually construct forward and backward pass
'''
# N, D_in, H, D_out = 64, 1000, 100, 10 #errors propagate much quicker with larger parameters. See output of
# both functions for, when parameters are varied. For example:
N, D_in, H, D_out = 6, 100, 10, 1
# Create random input and output data
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)
# randomly intialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)
w3 = copy.copy(w1)
w4 = copy.copy(w2)
# print('w1 init : ', w3)
# print('w2 init : ',w4)
learning_rate = 1e-6
for t in range(1):
# Forward pass : compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# compute and predict the loss
loss = (y_pred - y).pow(2).sum()
# print(t, loss)
# Backprop to compute the gradients of w1 and w2 wrt loss - building the autodiff computation graph
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# update the weights using the gradient descent
print(t)
print(grad_w1)
print(grad_w2)
w1 -= learning_rate*grad_w1
w2 -= learning_rate*grad_w2
print(' the loss with manual backprop {}'.format(loss))
return x, y, w3, w4
def simple_nn_torch_with_ag(x,y,w1,w2):
'''
A simple nn in torch following the pytorch tutorial - with autograd
'''
# N, D_in, H, D_out = 64, 1000, 100, 10 not needed as data is been passed from above.
# same intializations as above
x = torch.tensor(x, requires_grad=False)
y = torch.tensor(y, requires_grad=False)
w1 = torch.tensor(w1, requires_grad=True)
w2 = torch.tensor(w2, requires_grad= True)
learning_rate = 1e-6
# print('w1', w1)
# print('w2', w2)
for t in range(1):
# Forward pass : compute predicted y
y_pred = x.mm(w1).clamp(min=0).mm(w2)
# compute and predict the loss
loss = (y_pred - y).pow(2).sum()
# print(t, loss.item())
# loss.backwards() # It seems to only allow you to differentiate once as the grad property of the weights
# disappears in the next iteration.
#.
# grad_w1 = torch.autograd.grad(loss, [w1], retain_graph=True) # Just making sure this works.
# grad_w2 = torch.autograd.grad(loss, [w2], retain_graph=True)
grads = torch.autograd.grad(loss, [w1,w2], retain_graph= True)
print(t)
print(grads[0])
print(grads[1])
# update the wrights using the gradient descent
# w1 -= learning_rate*w1.grad.data # raises inplace error
# w2 -= learning_rate*w2.grad.data
w1 = w1 - learning_rate*grads[0]
w2 = w2 - learning_rate*grads[1]
# zeros the gradients to ensure that they are not accumulated with each epoch
# print(w1.grad.data) # Error: does not exist in the second iteration and so the lines below
# also fail.
# w1.grad.data.zero_()
# w2.grad.data.zero_()
print('The loss with ag {}'.format(loss.item()))
x,y,w1,w2 = simple_nn_torch()
simple_nn_torch_with_ag(x,y,w1,w2)
'''
## Errors
- cannot use Python special methods i.e _iadd_ special method w1 += ...
- gradients cannot be zeroed in the autograd example. This is because the gradients for the weights seem to be set to None after
after being called once. This means that the weights have to be reintiated as variables ( torch.tensor() ) again when inside of the for
loop. This then restricts the use of backward , torch.autograd.grad seems to work fine.
- For readability the N, D_in, H and D_out params were set to smaller values. For both functions, the losses should be
the same, however, they are not. This only gets worse as the number of iterations increases
and the size of the matrices increase.
I am not to sure why these errors occur. If you look at the gradients, they are close, but never exact and sometimes
can be quite different, despite the fact that they should be equal. As we traverse the graph manually in the
first function to construct the gradients and I would have assumed that torch.grad.autograd is doing the same process behind the scenes.
'''