The computation graph is breaking in the outer loop of meta-learning, Meta gradients are None (FOMAML)

Gradient computation in meta-learning: the computation graph is breaking in the outer loop. First Order MAML

I am relying on this pytorch lightning tutorial https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial16/Meta_Learning.html

In outer_loop()

p_global is None (all param.grad in self.model.parameters() are None, while those of local_model_parameters() are not). As a consequence p_global.grad += p_local.grad can’t be achieved

 for p_global, p_local in zip(self.model.parameters(), local_model.parameters()):
                    p_global.grad += p_local.grad  # First-order approx. -> add gradients of finetuned and base model

It could be related to

local_model = deepcopy(self.model)
In adapt_few_shot() but deepcopy() is needed, or may be create_graph=True is needed ?

I tried the following but l am not sure that it’s true

                for p_global, p_local in zip(self.model.parameters(), local_model.parameters()):

                      if p_global.grad is None:
                          p_global.grad = torch.zeros_like(p_local.grad)
                    
                      p_global.grad += p_local.grad

Doing this leads to :

p_global.grad = p_local.grad instead of p_global.grad += p_local.grad since after each gradient update, opt.step(), opt.zero_grad() every p_global.grad in self.model.parameters() becomes None

Code:

    def adapt_few_shot(self, support_imgs, support_targets):
        # Determine prototype initialization
        support_feats = self.model(support_imgs)
        prototypes, classes = ProtoNet.calculate_prototypes(support_feats, support_targets)
        support_labels = (classes[None,:] == support_targets[:,None]).long().argmax(dim=-1)
        # Create inner-loop model and optimizer
        local_model = deepcopy(self.model)
        local_model.train()
        local_optim = optim.SGD(local_model.parameters(), lr=self.hparams.lr_inner)
        local_optim.zero_grad()
        # Create output layer weights with prototype-based initialization
        init_weight = 2 * prototypes
        init_bias = -torch.norm(prototypes, dim=1)**2
        output_weight = init_weight.detach().requires_grad_()
        output_bias = init_bias.detach().requires_grad_()

        # Optimize inner loop model on support set
        for _ in range(self.hparams.num_inner_steps):
            # Determine loss on the support set
            loss, _, _ = self.run_model(local_model, output_weight, output_bias, support_imgs, support_labels)
            # Calculate gradients and perform inner loop update
            loss.backward()
            local_optim.step()
            # Update output layer via SGD
            # (https://discuss.pytorch.org/t/the-difference-between-torch-tensor-data-and-torch-tensor/25995/4):
            with torch.no_grad():
                output_weight.copy_(output_weight - self.hparams.lr_output * output_weight.grad)
                output_bias.copy_(output_bias - self.hparams.lr_output * output_bias.grad)

            # Reset gradients
            local_optim.zero_grad()
            output_weight.grad.fill_(0)
            output_bias.grad.fill_(0)

        # Re-attach computation graph of prototypes
        output_weight = (output_weight - init_weight).detach() + init_weight
        output_bias = (output_bias - init_bias).detach() + init_bias

        return local_model, output_weight, output_bias, classes

    def outer_loop(self, batch, mode="train"):
        accuracies = []
        losses = []
        self.model.zero_grad()

        # Determine gradients for batch of tasks
        for task_batch in batch:
            imgs, targets = task_batch
            support_imgs, query_imgs, support_targets, query_targets = split_batch(imgs, targets)
            # Perform inner loop adaptation
            local_model, output_weight, output_bias, classes = self.adapt_few_shot(support_imgs, support_targets)
            # Determine loss of query set
            query_labels = (classes[None,:] == query_targets[:,None]).long().argmax(dim=-1)
            loss, preds, acc = self.run_model(local_model, output_weight, output_bias, query_imgs, query_labels)
            # Calculate gradients for query set loss
            if mode == "train":
                loss.backward()

                for p_global, p_local in zip(self.model.parameters(), local_model.parameters()):
                    p_global.grad += p_local.grad  # First-order approx. -> add gradients of finetuned and base model

            accuracies.append(acc.mean().detach())
            losses.append(loss.detach())

        # Perform update of base model
        if mode == "train":
            opt = self.optimizers()
            opt.step()
            opt.zero_grad()

        self.log(f"{mode}_loss", sum(losses) / len(losses))
        self.log(f"{mode}_acc", sum(accuracies) / len(accuracies))

    def training_step(self, batch, batch_idx):
        self.outer_loop(batch, mode="train")
        return None  # Returning None means we skip the default training optimizer steps by PyTorch Lightning

I have ckecked the following discussions:

It seems like you are implementing first-order MAML (foMAML).

To achieve this, keep in mind that PyTorch does not initialize the gradient attributes of a module/parameter during instantiation, so these will be set to None. As a result, when you attempt to add the inner-loop grads to the meta-model, you attempt to add a tensor to None (the un-initialized meta-models gradient attributes).

You can resolve this, by manually setting the meta-models parameter’s attributes to a zero tensor with the same shape as the learnable weights, e.g. using torch.zeros_like. Or create an if statement to set the outer-loop grad to p_local.grad if p_global.grad is None.

Thank you @Harbar-Inbound for your response. Indeed it is FOMAML.

Yes, l have tried the following but l am not sure that it’s true

                for p_global, p_local in zip(self.model.parameters(), local_model.parameters()):

                      if p_global.grad is None:
                          p_global.grad = torch.zeros_like(p_local.grad)
                    
                      p_global.grad += p_local.grad

Doing this leads to :

p_global.grad = p_local.grad instead of p_global.grad += p_local.grad since after each gradient update, opt.step(), opt.zero_grad() every p_global.grad in self.model.parameters() becomes None

You are describing expected behavior, as opt.zero_grad() will set the .grad attribute of the optimized model to None.

See also the difference between step, backward, and zero_grad.

If you want to change the behavior (i.e., set grads to zero), see the set_to_none parameter of torch.optim.Optimizer here.

If you want to check that gradients are properly applied, you can implement the following:

  1. retain a copy of the meta-model (e.g., deep-copy),
  2. step with an SGD optimizer (without any momentum, etc.),
  3. compare the difference between the updated meta-model and copied model, and the gradient of the local-model.

If the result of step 3 is torch.allclose, then gradients were applied properly*.

*These results may differ slightly if you use an optimizer without momentum, etc., and a learning rate of 1.0.

To be sure l understand your trick @Harbar-Inbound , you mean testing the following:

opt.step()
torch.allclose(p_global.grad,p_local.grad) 
torch.allclose(p_global,p_local)
opt.zero_grad()

?