retain_graph=True works, but why tho?

Hello!

I have following model and training routine:

model = nn.Sequential(
        nn.Linear(feats_size, feats_size / 2),
        nn.ReLU(),
        nn.Linear(feats_size / 2, feats_size / 4),
        nn.ReLU(),
        nn.Linear(feats_size / 4, num_images),
        nn.Tanh()
)

for batch_id, batch in enumerate(self.t_loader):
        feats = batch[0].to(device)
        labels = batch[1].to(device)
        output = model(feats)
        loss = criterion(output, labels)
        loss.backward(retain_graph=True)
        optimizer.step()
        optimizer.zero_grad()

From other related discussion threads, I gathered that retain_graph=True is required when there is some sharing between/across batches. Clearly there is no such sharing in the above case but PyTorch still throws an error without retain_graph=True. What am I missing?

Retaining the graph would add an overhead as the computation graph with the stored intermediate forward activations won’t be freed.
I wouldn’t know why it should fail so could you explain your expectation of failure a bit more?

My bad, by error, I was referring to the following (when retain_graph=True is removed from the above code):

Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

And I don’t understand why the previous batch’s graph is required for later propagations? :thinking:

I cannot reproduce the error using your model definition and your training loop with random input data:

feats_size = 16
num_images = 10
model = nn.Sequential(
        nn.Linear(feats_size, feats_size // 2),
        nn.ReLU(),
        nn.Linear(feats_size // 2, feats_size // 4),
        nn.ReLU(),
        nn.Linear(feats_size // 4, num_images),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

num_samples = 100
dataset = TensorDataset(torch.randn(num_samples, 16), torch.randint(0, num_images, (num_samples,)))
loader = DataLoader(dataset, batch_size=5)

criterion = nn.CrossEntropyLoss()
device = 'cpu'

for batch_id, batch in enumerate(loader):
        feats = batch[0].to(device)
        labels = batch[1].to(device)
        output = model(feats)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

I also don’t see how this error could be raised in your code as the training loop doesn’t seem to append operations to the computation graph from previous iterations.

1 Like

Yes! the above code works fine (without retain_graph=True).
Made me realise that the problem is neither with the model nor with the training procedure. It was of-course with the dataset/loader:

  • For constructing the Dataset, I was loading data (feature tensors) from disk, which had requires_grad = True (were saved by some other process that way).

Just doing data = data.detach() before constructing the Dataset worked. Thanks for helping out :slight_smile:

1 Like

Oh, that’s tricky to narrow down but good to hear you’ve found the issue and it’s working now. :slight_smile:

1 Like