Leaf variable was used in an inplace operation

I code a function which implements some operations including torch.mm, torch.index_select and torch.cc. However, there comes out an AssertionError, leaf variable was used in an inplace operation.

In the source code of Variable.py(line 199), I found the assertion, assert self.__version == 0. But it’s not clear to say what is going wrong here. Could anyone help me on this?

3 Likes

Loosely, tensors you create directly are leaf variables. Tensors that are the result of a differentiable operation are not leaf variables

For example:

w = torch.tensor([1.0, 2.0, 3.0]) # leaf variable
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) # also leaf variable
y = x + 1  # not a leaf variable

(The PyTorch documentation for is_leaf contains a more precise definition.)

An in-place operation is something which modifies the data of a variable. For example:

x += 1  # in-place
y = x + 1 # not in place

PyTorch doesn’t allow in-place operations on leaf variables that have requires_grad=True (such as parameters of your model) because the developers could not decide how such an operation should behave. If you want the operation to be differentiable, you can work around the limitation by cloning the leaf variable (or use a non-inplace version of the operator).

x2 = x.clone()  # clone the variable
x2 += 1  # in-place operation

If you don’t intend for the operation to be differentiable, you can use torch.no_grad:

with torch.no_grad():
    x += 1
35 Likes

Is it bad practice to get around pytorch disallowing in-place operations by assigning to Variable().data?

That depends on the situation.
For example to initialize or update parameters, assigning to .data is the way to go. Usually, you cannot backprop when changing Variables’ .data in the middle of a forward pass…
I’d summerise my experience as “don’t do it unless you have reason to believe its the rigght thing”.

Best regards

Thomas

7 Likes

Yes, I remember now, messing with backprop and autograd was why I was running into problems with in-place assignment before. Using .data as I am currently for initialising word embeddings seems ok then.

Hi,

Trying to read the code for optim (I want to implement something a bit differently) and your previous example/explanation of what is a leaf Variable doesn’t seem to be valid anymore.

In particular,
you wrote

y = x + 1  # not a leaf variable

Well, here’s output from my termial for the code which you have mentioned:

>>> x = torch.autograd.Variable(torch.Tensor([1, 2, 3, 4]))
>>> x.is_leaf
True
>>> y = x + 1
>>> y.is_leaf
True
>>> y
Variable containing:
 2
 3
 4
 5
[torch.FloatTensor of size (4,)]

So, can someone please explain what is a leaf Variable, and what is not a leaf variable? Clearly a non-leaf-variable cannot be optimized, but what is it?

1 Like
Came across a similar issue. Reason is requires grad.

x = torch.autograd.Variable(torch.Tensor([1, 2, 3, 4]), requires_grad=True)
x.is_leaf    
#True
y = x + 1
y.is_leaf
#False
6 Likes

Hi, I used to create leaf variable like:

y = torch.autograd.Variable(torch.zeros([batch_size, c, h, w]), requires_grad=True)

Then I want to assign value to indexed parts of y like below,(y_local is a Variable computed based on other variables and I want to assign the value of y_local to part of the y and ensure that the gradients from y can flow to the y_local.)

y.data[:,:,local_x[i]:local_x[i+1],local_y[i]:local_y[i+1]] = y_local.data

I am wondering such operation supports the normal gradient backward for the y_local varible?

2 Likes

I am also facing similar issue. Please let me know how were you able to resolve it

leaf variable, in essence, is a variable, or a tensor with requires_grad=True. So, if a tensor with requires_grad=False, it does not belong to the variable, let alone leaf variable.

But y is a variable in this case, it’s not a leaf node.

Could you please explain why it’s not correct?

1 Like

That usually isn’t corret… well… what if it IS correct? What should I do?
I’m pretty sure of what I’m doing, but can’t do it on Pytorch.

I had a similar case. I used:

y = torch.zeros([batch_size, c, h, w]), requires_grad=False)

then I update the value of y according to the value of the network output and then apply a loss function on y and it worked for me.

Is the reason this is not “usually correct” because we could have just initialized it directly with the data that we wanted in the first place instead of doing a in-place op?

@colesbury can you address this question please?

Why is y a leaf when you claim it should not?

I also agree it shouldn’t be a leaf but Pytorch disagrees with us…why?

@pinocchio, I’m updating my reply and correcting the example. The not “usually correct” wasn’t a good explanation. The actual reason is that the PyTorch developers could not come to a consensus on reasonable semantics for such an operation.

1 Like

I think you are wrong, y is indeed not a leaf. Maybe you had a weird version of Pytorch?

def inplace_playground():
    import torch

    x = torch.tensor([1,2,3.], requires_grad=True)
    y = x + 1
    print(f'x.is_leaf = {x.is_leaf}')
    print(f'y.is_leaf = {y.is_leaf}')
    x += 1

output:

x.is_leaf = True
y.is_leaf = False

@colesbury I think you were correct. Not sure what you corrected but I tried the leaf thing and it seems your right that y is not a leaf (as expected).

Thanks!

Why not? What were the competing semantics? What’s the difficulty in defining the semantics for leafs + in-place ops?

Perhaps, you could try adding the following code before updating the gradient:

with torch.no_grad():

and it works:

N, D_in, H, D_out = 64, 1000, 100, 10

X = torch.randn(N, D_in).cuda()
y = torch.randn(N, D_out).cuda()

# device = torch.cuda.device('cuda:0')
device = torch.device('cuda:0')
w1 = torch.randn(D_in, H, requires_grad=True, device=device)
w2 = torch.randn(H, D_out, requires_grad=True, device=device)

learning_rate = 1e-6
for it in range(10):
    # 1.forward pass
    y_pred = X.mm(w1).clamp(min=0).mm(w2).cuda()
    # 2.compute loss
    loss = (y_pred - y).pow(2).sum().cuda()
    print(f'iter {it}, loss {loss}')
    # Backward pass
    loss.backward()
    # update weights of w1 and w2
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()