Strange autograd behaviour with master branch

I recently updated to the master branch on 08/03/2018, Pytorch version: 0.4.0a0+363de58, so that I could take advantage of torch.tensor(data, ... , requires_grad=True). I went through some of the tutorials to ensure that I was getting equivalent answers.

However, I found that in the basic neural network autograd example that my loss functions were arriving at slightly different outcomes (out by a percent or so) despite using exactly the same initialisations. This only got worse as I changed the size of the initial parameters (losses were out by around 10% or more).

I also found that within the for loop, you can no longer do loss.backwards() for more than one iteration. This is because the gradient weights get set to None after the first iterations and you have to recast them as variables or torch.tensors() separately for each iteration.

In addition to this, the use of python special methods, such as _iadd_ i.e w1 += learning_rate*grad_w1 or w1 -= learning_rate*grad_w1 is no longer possible as it returns: RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.

I understand that as this has only been implemented recently the likelihood of bugs is high. It may also be because of me, but I can’t see my error, if it is me.

Please find the code to recreate this below. OS: Mac El Captain, Python 3.6, Pytorch version: 0.4.0a0+363de58

# -*- coding: utf-8 -*-
import torch
import copy
dtype = torch.FloatTensor

def simple_nn_torch():
    A simple nn in torch following the pytorch tutorial. Manually construct forward and backward pass

    # N, D_in, H, D_out = 64, 1000, 100, 10  #errors propagate much quicker with  larger parameters. See output of
    # both functions for, when parameters are varied. For example:

    N, D_in, H, D_out = 6, 100, 10, 1

    # Create random input and output data

    x = torch.randn(N, D_in).type(dtype)
    y = torch.randn(N, D_out).type(dtype)

    # randomly intialize weights

    w1 = torch.randn(D_in, H).type(dtype)
    w2 = torch.randn(H, D_out).type(dtype)
    w3  = copy.copy(w1)
    w4 = copy.copy(w2)
    # print('w1 init : ', w3)
    # print('w2 init : ',w4)
    learning_rate = 1e-6

    for t in range(1):
        # Forward pass : compute predicted y
        h =
        h_relu = h.clamp(min=0)
        y_pred =

        # compute and predict the loss
        loss = (y_pred - y).pow(2).sum()
        # print(t, loss)

        # Backprop to compute the gradients of w1 and w2 wrt loss - building the autodiff computation graph
        grad_y_pred = 2.0 * (y_pred - y)
        grad_w2 = h_relu.t().mm(grad_y_pred)
        grad_h_relu =
        grad_h = grad_h_relu
        grad_h[h < 0] = 0
        grad_w1 = x.t().mm(grad_h)

        # update the weights using the gradient descent
        w1 -= learning_rate*grad_w1
        w2 -= learning_rate*grad_w2

    print(' the loss with manual backprop {}'.format(loss))
    return x, y, w3, w4

def simple_nn_torch_with_ag(x,y,w1,w2):
    A simple nn in torch following the pytorch tutorial - with autograd
    # N, D_in, H, D_out = 64, 1000, 100, 10 not needed as data is been passed from above.
    # same intializations as above

    x = torch.tensor(x, requires_grad=False)
    y = torch.tensor(y, requires_grad=False)

    w1 = torch.tensor(w1, requires_grad=True)
    w2 = torch.tensor(w2, requires_grad= True)

    learning_rate = 1e-6
    # print('w1', w1)
    # print('w2', w2)
    for t in range(1):
        # Forward pass : compute predicted y
        y_pred =

        # compute and predict the loss
        loss = (y_pred - y).pow(2).sum()
        # print(t, loss.item())

        # loss.backwards() # It seems to only allow you to differentiate once as the grad property of the weights
        # disappears in the next iteration.
        # grad_w1 = torch.autograd.grad(loss, [w1], retain_graph=True) # Just making sure this works.
        # grad_w2 = torch.autograd.grad(loss, [w2], retain_graph=True)
        grads = torch.autograd.grad(loss, [w1,w2], retain_graph= True)

        # update the wrights using the gradient descent

        # w1 -= learning_rate*  # raises  inplace error
        # w2 -= learning_rate*
        w1 = w1 -  learning_rate*grads[0]
        w2 = w2 - learning_rate*grads[1]

        # zeros the gradients to ensure that they are not accumulated with each epoch
        # print( # Error: does not exist in the second iteration and so the lines below
        # also fail.
    print('The loss with ag {}'.format(loss.item()))

x,y,w1,w2 = simple_nn_torch()

## Errors 

- cannot use Python special methods i.e _iadd_ special method w1 += ...
- gradients cannot be zeroed in the autograd example. This is because the gradients for the weights seem to be set to None after 
    after being called once. This means that the weights have to be reintiated as variables ( torch.tensor() ) again when inside of the for 
    loop. This then restricts the use of backward ,  torch.autograd.grad seems to work fine. 
- For readability the N, D_in, H and D_out params were set to smaller values. For both functions, the losses should be 
    the same, however, they are not. This only gets worse as the number of iterations increases 
    and the size of the matrices increase.
    I am not to sure why these errors occur. If you look at the gradients, they are close, but never exact and sometimes 
    can be quite different,  despite the fact that they should be equal. As we traverse the graph manually in the 
    first function to construct the gradients and I would have assumed that torch.grad.autograd is doing the same process behind the scenes. 

I can’t address all of this, but have you looked into retain_variables=True for .backward()? This will allow you to call backward multiple times. See this.