To(device) causing 'ValueError: can't optimize a non-leaf Tensor'

Hello, I am currently implementing an automated machine learning algorithm with weight inheritance. I implemented my NNs as networkx graphs. Since I mutate the graphs from one generation to another, I add new layers and hence need to pass those to the cuda device again. I do this in the a for loop over the nodes in the graph in the the forward pass. In each iteration of the loop, I first pass the nodes nn.module to the cuda device then forward pass the data through it. My issue is that I get the above mentioned error message at some point. Is there anyway to avoid this issue ?`
def forward(self, inputs):
# Evaluate the graph in topological ordering
topological_order = nx.algorithms.dag.topological_sort(self)

    self.nodes[self.get_input_nodes()]['output'] = inputs

    for node in topological_order:
        # try:
        node_info = self.nodes[node]

        preds = list(self.predecessors(node))

        if len(preds) > 0:
            cell_input = [self.nodes[pred]['output'] for pred in preds]
            if node_info['type'] == 'merge':
                node_info['output'] = node_info['op'](cell_input)
                node_info['output'] = node_info['op'](cell_input[0])

            node_info['params']['output_dim'] = node_info['output'].size()

    return [self.nodes[node]['output'] for node in self.get_output_nodes()][0]`

I don’t get it at all but since you are adding new layers I imagine you have to pass them to the optimizer at some point right?
Soo the leaf tensor is the original nn.Module. Once you allocate it in cuda, the output of that allocation is a non-leaf tensor whose backward is something like “copy to the cpu leaf node”

Soooo I would tell you to pass the cpu module to the optimizer or to convert the one in cuda into a leaf tensor.

Anyway it would be nice if you paste a code to reproduce it.

1 Like

What I do is the following:

  • First create some networks by hand and train them (works perfectly fine)
  • I create a deep copy of the networks, that I want to mutate by adding layers
  • then I create an evaluator, which trains the mutated child networks (here I get the error with the leaf nodes)
    This is the code of the evaluator:
class Evaluator:
    def __init__(self, graph: NodeOpGraph, train_loader, *args, **kwargs):
        self.graph = graph
        self.train_loader = train_loader
        self.optimizer = torch.optim.Adam(self.graph.parameters())
        self.criterion = torch.nn.BCELoss(reduction='none')

    def train(self, n_samples_per_epoch, epochs=1, log_interval=10, verbose=True):
        print('Device is {}'.format(device))

        batch_size = next(self.train_loader)[0].shape[0]
        n_steps_per_epoch = int(np.ceil(n_samples_per_epoch / batch_size))

        print('BATCH SIZE:', batch_size)
        print('N_STEPS:', n_steps_per_epoch)

        for epoch in range(epochs):

            if not verbose:
                old_stdout = sys.stdout
                sys.stdout = open(os.devnull, 'w')
            print('EPOCH #', epoch)
            total_loss = 0.
            total_epoch_loss = 0.
            total_size = 0

            for step, (inputs, labels, sample_weights) in enumerate(self.train_loader):
                s = inputs.shape
                inputs = np.reshape(inputs, (s[0], s[3], s[1], s[2]))
                inputs = torch.Tensor(inputs)
                labels = torch.Tensor(labels)
                sample_weights = torch.Tensor(sample_weights).to(device)
                preds = self.graph(
                preds = torch.reshape(preds, (preds.shape[0],))
                loss = self.criterion(preds,
                loss = loss * sample_weights
                loss = loss.mean()

                # Todo float(loss.item()) otherwise maybe memory issues
                total_loss += loss.item()
                total_epoch_loss += loss.item()
                total_size += labels.size(0)

                # if step % log_interval == log_interval-1:
                print('Step {} Avg loss: {}'.format(str(step), str(total_loss / total_size)))
                total_loss = 0.

                if step >= n_steps_per_epoch:

            if not verbose:
                sys.stdout = old_stdout

            print('*' * 25, '\nEpoch {} Avg loss: {}\n'.format(str(epoch), str(total_epoch_loss / total_size)), '*' * 25)

    def eval(self):

The problem is that part of the network is not on the GPU model by that point because of the adding a layer. Therefor I pass each node of the network graph first to the gpu. This works fine for the forward passes, but for the optimizer step it fails.
Is it because I passed the graph to gpu in the forward pass ?

When I first pass alll the nodes to cpu with the following code:

    def to_cpu(self):
        topological_order = nx.algorithms.dag.topological_sort(self)

        for node in topological_order:
            # try:
            node_info = self.nodes[node]

When I call:

self.optimizer = torch.optim.Adam(self.graph.parameters())

I get the following error message:

UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See for more informations.

I would already thankful if you could give me any tips on how to locate the tensors causing this issue :slight_smile:

Soo can you paste a standalone snippet with a toy example?
It seems you are wrapping everything with a nn.module class (graph).

Buut maybe node_info is a standard dict? It’s a bit difficult to say

I think you are doing something like

import torch
from torch import nn
class Toy(nn.Module):
    def __init__(self):
        self.w = wrong.cuda()
        self.good = nn.Parameter(torch.Tensor([5]))
    def forward(self):

m = Toy()

False <CopyBackwards object at 0x7f65a3de7f98>
True None

Think that this behaviour doesn’t have to happen inside init, it can be extrapolable to however you are instantiating new nodes.