Why am I getting this error about a Leaf Variable?

I’m trying to train some entity embeddings, but when doing a backward pass I get an error about Leaf Variables being moved into the graph interior. I saw there are other threads where people had that issue and it’s due to in-place operations or assigning to tensors, but I’m not doing anything like that so I’m unsure what’s causing it.

My code is based on the code here; here’s my network:

class GameEmbedding(torch.nn.Module):
    def __init__(self, num_games, num_links, embedding_dim=64):
        super(GameEmbedding, self).__init__()
        self.game_embedding = torch.nn.Embedding(num_games, embedding_dim, max_norm=1.0)
        self.link_embedding = torch.nn.Embedding(num_links, embedding_dim, max_norm=1.0)
        self.embedding_dim = embedding_dim

    def forward(self, batch):
        # in the batch each input is [game, link, label]
        # label is 1 (true) or -1 (false)
        t1 = self.game_embedding(torch.LongTensor([v[0] for v in batch]))
        t2 = self.link_embedding(torch.LongTensor([v[1] for v in batch]))
        dot_products = torch.bmm(
            t1.contiguous().view(len(batch), 1, self.embedding_dim),
            t2.contiguous().view(len(batch), self.embedding_dim, 1)
        )
        return dot_products.contiguous().view(len(batch))

Here’s my training, skipping initialization params:

for i in range(num_epochs):
    for j in range(num_steps_per_epoch):
        optimizer.zero_grad()
        minibatch = build_minibatch(num_positives, num_negatives)
        y = model.forward(minibatch)
        target = torch.FloatTensor([v[2] for v in minibatch])
        loss = loss_function(y, target)
        if i == 0 and j == 0:
            print('r: loss = %.3f' % float(loss))
        loss.backward(retain_graph=True)
        optimizer.step()
    print('%s: loss = %.3f' % (i, float(loss)))

And here’s detailed error output:

Traceback (most recent call last):
  File "train.py", line 88, in <module>
    loss.backward(retain_graph=True)
  File "/mnt/pool/code/gamesearch/env/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/pool/code/gamesearch/env/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: leaf variable has been moved into the graph interior

Sorry I can’t ask a more specific question, but after reading up on other cases of this error and how leaf variables work I have no idea why it’s happening here.

This part looks like a “beep” to me.
If you check out the official docs you will try to avoid retain_graph.

Will you try backward w/o the retain_graph=False and let us know. Also, is your train.py somewhere for preview?

My whole train.py is here.

retain_graph=True is from the code I’m using as a base. I tried removing it but my error didn’t change.

Sorry, I meant to say retain_graph=False.

Hi,

If you can try to get a small code sample that reproduces the issue, that would help a lot to find out what the problem is.
My approach for this is usually, in a new file:

  • copy/paste your modules definitions
  • copy/paste your training inner loop
  • add code that generate one random input

If the error does not appear any-more, add back more things from your model initialization / data preprocessing.
If the error still occurs, start removing stuff from this new file until it disappears.
Once you have such a minimal example, could you post it here?

So I worked on making a minimal example of this, and in fiddling with the parameters I found that reducing the minibatch size makes the error go away. I’m glad it’s fixed but this doesn’t make sense to me - should changing the minibatch size change my computation graph? Based on the error message I would never have guessed it was related to my batch size.

The simplified version of the code I made is here, though the fix is the same as in the original.

That’s interesting. I can run your code sample with no issue locally. (Even after increasing the batch size drastically to (5000, 5000).
Which version of pytorch are you using?
It running this code consistently give you the error above?

I ran the minimal version’s code several more times and at 500 it gives me the error every time. At 499 it works.

I’m using torch 1.3.0, installed via pip, on Linux.

Ok running with 1.3 on linux I can repro.
Also I am very worried as I reduced your example even further to:

import torch
import torch.nn as nn

# <= 1000 works
# > 1000 does not work
val = 1001

mod = torch.nn.Embedding(100, 64, max_norm=1.0)
v0 = torch.rand(val).mul(100).long()

out = mod(v0)
print(out.grad_fn)
print(out.grad_fn.next_functions) # This should show an AccumulateGrad. If it's a CopySlices, it will fail below
print(out.sum().backward())

Could you check what is the breakpoint value for you? Is it the same?

Your code works for me at 999 but fails at 1000 as well as 1001. As you suggest in your comment, at 1000 it’s a CopySlices, but at 999 it’s an AccumulateGrad.

I guess it’s safe to say this is a bug?

Yes !
I opened an issue there: https://github.com/pytorch/pytorch/issues/28370

1 Like