RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior

for learn to learn this solved my issue:

Solution:

current solution for me is to:

  • make sure your learn2learn loop that manually does gradient accumulation (for some reason) does not do it for params that have a .grad field of None
  • initiate MAML with allow_unused=True
  • idk how to set the flag globally but it might be nice for other ppls problems
  • if your using torch.autograd.grad, set it’s flag allow_unused=True

code:

#!/usr/bin/env python3

"""
Demonstrates how to:
    * use the MAML wrapper for fast-adaptation,
    * use the benchmark interface to load mini-ImageNet, and
    * sample tasks and split them in adaptation and evaluation sets.
To contrast the use of the benchmark interface with directly instantiating mini-ImageNet datasets and tasks, compare with `protonet_miniimagenet.py`.
"""

import random
import numpy as np

import torch
from torch import nn, optim

import learn2learn as l2l
from learn2learn.data.transforms import (NWays,
                                         KShots,
                                         LoadData,
                                         RemapLabels,
                                         ConsecutiveLabels)


def accuracy(predictions, targets):
    predictions = predictions.argmax(dim=1).view(targets.shape)
    return (predictions == targets).sum().float() / targets.size(0)


def fast_adapt(batch, learner, loss, adaptation_steps, shots, ways, device):
    data, labels = batch
    data, labels = data.to(device), labels.to(device)

    # Separate data into adaptation/evalutation sets
    adaptation_indices = np.zeros(data.size(0), dtype=bool)
    adaptation_indices[np.arange(shots * ways) * 2] = True
    evaluation_indices = torch.from_numpy(~adaptation_indices)
    adaptation_indices = torch.from_numpy(adaptation_indices)
    adaptation_data, adaptation_labels = data[adaptation_indices], labels[adaptation_indices]
    evaluation_data, evaluation_labels = data[evaluation_indices], labels[evaluation_indices]

    # Adapt the model
    for step in range(adaptation_steps):
        adaptation_error = loss(learner(adaptation_data), adaptation_labels)
        learner.adapt(adaptation_error)

    # Evaluate the adapted model
    predictions = learner(evaluation_data)
    evaluation_error = loss(predictions, evaluation_labels)
    evaluation_accuracy = accuracy(predictions, evaluation_labels)
    return evaluation_error, evaluation_accuracy


def main(
        ways=5,
        shots=5,
        meta_lr=0.003,
        fast_lr=0.5,
        meta_batch_size=32,
        adaptation_steps=1,
        num_iterations=60000,
        cuda=True,
        seed=42,
):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    device = torch.device('cpu')
    if cuda and torch.cuda.device_count():
        torch.cuda.manual_seed(seed)
        device = torch.device('cuda')

    # Create Tasksets using the benchmark interface
    tasksets = l2l.vision.benchmarks.get_tasksets('mini-imagenet',
                                                  train_samples=2 * shots,
                                                  train_ways=ways,
                                                  test_samples=2 * shots,
                                                  test_ways=ways,
                                                  root='~/data/l2l_data/',
                                                  )

    # Create model
    # model = l2l.vision.models.MiniImagenetCNN(ways)
    from uutils.torch_uu.models.hf_uu.vit_uu import get_vit_get_vit_model_and_model_hps_mi
    model, _ = get_vit_get_vit_model_and_model_hps_mi()
    model.to(device)
    maml = l2l.algorithms.MAML(model, lr=fast_lr, first_order=False, allow_unused=True)
    opt = optim.Adam(maml.parameters(), meta_lr)
    loss = nn.CrossEntropyLoss(reduction='mean')

    for iteration in range(num_iterations):
        opt.zero_grad()
        meta_train_error = 0.0
        meta_train_accuracy = 0.0
        meta_valid_error = 0.0
        meta_valid_accuracy = 0.0
        for task in range(meta_batch_size):
            # Compute meta-training loss
            learner = maml.clone()
            batch = tasksets.train.sample()
            evaluation_error, evaluation_accuracy = fast_adapt(batch,
                                                               learner,
                                                               loss,
                                                               adaptation_steps,
                                                               shots,
                                                               ways,
                                                               device)
            evaluation_error.backward()
            meta_train_error += evaluation_error.item()
            meta_train_accuracy += evaluation_accuracy.item()

            # Compute meta-validation loss
            learner = maml.clone()
            batch = tasksets.validation.sample()
            evaluation_error, evaluation_accuracy = fast_adapt(batch,
                                                               learner,
                                                               loss,
                                                               adaptation_steps,
                                                               shots,
                                                               ways,
                                                               device)
            meta_valid_error += evaluation_error.item()
            meta_valid_accuracy += evaluation_accuracy.item()

        # Print some metrics
        print('\n')
        print('Iteration', iteration)
        print('Meta Train Error', meta_train_error / meta_batch_size)
        print('Meta Train Accuracy', meta_train_accuracy / meta_batch_size)
        print('Meta Valid Error', meta_valid_error / meta_batch_size)
        print('Meta Valid Accuracy', meta_valid_accuracy / meta_batch_size)

        # Average the accumulated gradients and optimize
        for p in maml.parameters():
            if p.grad is not None:
                p.grad.data.mul_(1.0 / meta_batch_size)
        opt.step()

    meta_test_error = 0.0
    meta_test_accuracy = 0.0
    for task in range(meta_batch_size):
        # Compute meta-testing loss
        learner = maml.clone()
        batch = tasksets.test.sample()
        evaluation_error, evaluation_accuracy = fast_adapt(batch,
                                                           learner,
                                                           loss,
                                                           adaptation_steps,
                                                           shots,
                                                           ways,
                                                           device)
        meta_test_error += evaluation_error.item()
        meta_test_accuracy += evaluation_accuracy.item()
    print('Meta Test Error', meta_test_error / meta_batch_size)
    print('Meta Test Accuracy', meta_test_accuracy / meta_batch_size)


if __name__ == '__main__':
    """
python ~/ultimate-utils/tutorials_for_myself/my_l2l/serial_maml_l2l_hf_vit_simple.py
    """
    main()

Depends a bit how “production-ready” you want this to be.
But if it is for your code and not for a lib, I would say that the following will do the trick just fine :slight_smile:

import torch
from torch.autograd import grad as original_grad

def new_grad(*args, **kwargs):
    kwargs['allow_unused'] = True

    return original_grad(*args, **kwargs)

torch.autograd.grad = new_grad

Could you please explain further what is a leaf tensor? And why is it problematic?

A leaf tensor is a tensor the user has created of which was implicitly created as a parameter or buffer in a module. It thus doesn’t have a gradient history.
The view operation is problematic, since a.view(-1) was never used in the computation graph but was instead only created in the autograd.grad call.
Using a non-leaf tensor by itself in autograd.grad is not problematic as long as it’s part of the computation graph.

Hi, is autograd.grad compatible with nn.DataParallel then?

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.m1 = nn.Sequential(
            nn.Linear(2, 4),
            nn.ReLU(),
        )
        self.m2 = nn.Linear(4, 4)

    def forward(self,x):
        f = self.m1(x)
        res = self.m2(f)

        return res, f

x = torch.randn(4, 2).cuda()
x.requires_grad = True
y = torch.arange(4).cuda()

model = Model().cuda()
model_p = nn.DataParallel(model)
res, f = model_p(x)
loss_single = F.cross_entropy(res, y)

print(torch.autograd.grad(loss_single, x, retain_graph=True, allow_unused=True)[0])
print(torch.autograd.grad(loss_single, f, retain_graph=True, allow_unused=True)[0])

tensor([[ 0.0000, 0.0000],
[ 0.0000, 0.0000],
[ 0.0278, -0.0232],
[ 0.0015, 0.0003]], device=‘cuda:0’)
None
I tried to check the gradients passed to the intermediate results “f”, but it returned None. Isn’t this a proper practice? (this only happens on multi GPUs)
Thank you!

I’m unsure about the feature support of nn.DataParallel as it’s already deprecated.
We generally recommend using DistributedDataParallel, which should support your workflow.

For my code , I need to change the shape of sample and if I use view or reshape I get the sam error. How to change shape or add dimension such that it works.

view and reshape will work so make sure the output is used in the computation graph. If you get stuck, please post a minimal and executable code snippet to reproduce the issue.