Model not learning with custom loss function and parameters not updated

Dear community,

I have an ML model (A GNN to be more precise) with a softmax activation function in its output layer, with num_classes = K.

They way my custom loss function works is as follows (summary):

  • Takes the output of the whole dataset.
  • Samples a class for each data point.
  • Build a graph structure with max K nodes.
  • Runs an evaluation algorithm of my own that return a Value. The higger the value, the better the classification.

Then, I use Pytorch loss.backward() to update the gradients.

def train(data):
    optimizer.zero_grad()  # Clear gradients.
    out = model(data.x, data.edge_index)  # Perform a single forward pass.
    L = sampling_classes(values, torch.exp(out))
    graph = build_graph(L)
    loss = loss_value_function(graph, out)
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss
def loss_value_function(graph, out):
    val = evaluate_graph(graph)
    return (torch.sum(torch.exp(out)) - torch.sum(torch.tensor(val)))

In return (torch.sum(torch.exp(out)) - torch.sum(torch.tensor(val))), I realised I have to substract torch.sum(torch.tensor(val)) to (torch.sum(torch.exp(out)) (which is always the number of data points), otherwise I break the gradient, and then the parameters are not updated if I only return (-1) * torch.sum(torch.tensor(val))). Is this correct? is there any way I can just return (-1) * torch.sum(torch.tensor(val))) and then get the gradient updated when I perform the backward pass?

I tried returng (-1) * torch.sum(torch.tensor(val))) as loss and then doing loss = torch.tensor(loss, required_grad=True), but the parameters are never updated.

Any suggestion would be appreaciated.

Thank you!



When you do torch.tensor(val) you’re breaking your computational graph, which is why your parameters are not updating. So, remove all these torch.tensor and re-run your code and see if it updates then!

Hi @AlphaBetaGamma96, Thank you for your reply. I have tried this, however, if I remove torch.tensor(), then when I do loss.backward() I get this Error: AttributeError: 'numpy.float64' object has no attribute 'backward':

def loss_value_function(graph, out):
    val = evaluate_graph(graph)
    return np.sum(val)

You cannot use numpy arrays directly as this will also break the computation graph. Either use PyTorch operations only or write a custom autograd.Function if you need to use numpy operations.

Hi @ptrblck , thank you for your response. That’s true. I also tried return torch.sum(torch.tensor(VK, requires_grad=True)), but the computational graph breaks and the parameters don’t update as well.
However, if I do return (torch.sum(torch.exp(out)) - torch.sum(torch.tensor(VK))), then the parameters are updated. Although I find this pretty weird. Any suggestion?

Recreating a tensor will break the computation graph as already mentioned. If your second approach works fine I would guess that another tensor might be attached to a computation graph (out) while VK acts as a constant and won’t affect the gradient calculation.

Thank you @ptrblck .

Yes, this is what I thought. out is a tensor, and it is constant. Then, I just use VK as a value to be substracted from out. The greate VK, the smaller the loss.

Thank you

Ok, I found what’s the problem. The problem is that I was doing a random sampling in my log_softmax output layer and that, breaks the gradient. In order to update the parameters when we want to do a random sampling, we need to create a categorical distribution, sample from that distribution and then, update the parameters with the log derivative:

def train(data):
    optimizer.zero_grad()  # Clear gradients.
    out = model(data.x, data.edge_index)  # Perform a single forward pass.
    m = torch.distributions.categorical.Categorical(out) # Creates a categorical distribution
    L = m.sample() # Samples from the distribution. Each datapoint is assigned to a class.
    graph = build_graph(L)
    val = loss_value_function(graph)
    loss  = -m.log_prob(L) * val 
    loss.sum().backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss
def loss_value_function(graph):
    val = evaluate_graph(graph)
    return val

Hope this helps.
