Model not learning with custom loss function and parameters not updated

JFM_AI · October 10, 2022, 11:36pm

Dear community,

I have an ML model (A GNN to be more precise) with a softmax activation function in its output layer, with num_classes = K.

They way my custom loss function works is as follows (summary):

Takes the output of the whole dataset.
Samples a class for each data point.
Build a graph structure with max K nodes.
Runs an evaluation algorithm of my own that return a Value. The higger the value, the better the classification.

Then, I use Pytorch loss.backward() to update the gradients.

def train(data):
    optimizer.zero_grad()  # Clear gradients.
    out = model(data.x, data.edge_index)  # Perform a single forward pass.
    L = sampling_classes(values, torch.exp(out))
    graph = build_graph(L)
    loss = loss_value_function(graph, out)
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss

def loss_value_function(graph, out):
    val = evaluate_graph(graph)
    return (torch.sum(torch.exp(out)) - torch.sum(torch.tensor(val)))

In return (torch.sum(torch.exp(out)) - torch.sum(torch.tensor(val))), I realised I have to substract torch.sum(torch.tensor(val)) to (torch.sum(torch.exp(out)) (which is always the number of data points), otherwise I break the gradient, and then the parameters are not updated if I only return (-1) * torch.sum(torch.tensor(val))). Is this correct? is there any way I can just return (-1) * torch.sum(torch.tensor(val))) and then get the gradient updated when I perform the backward pass?

I tried returng (-1) * torch.sum(torch.tensor(val))) as loss and then doing loss = torch.tensor(loss, required_grad=True), but the parameters are never updated.

Any suggestion would be appreaciated.

Thank you!

Cheers.

AlphaBetaGamma96 · October 10, 2022, 11:42pm

Hi @JFM_AI,

When you do torch.tensor(val) you’re breaking your computational graph, which is why your parameters are not updating. So, remove all these torch.tensor and re-run your code and see if it updates then!

JFM_AI · October 10, 2022, 11:51pm

Hi @AlphaBetaGamma96, Thank you for your reply. I have tried this, however, if I remove torch.tensor(), then when I do loss.backward() I get this Error: AttributeError: 'numpy.float64' object has no attribute 'backward':

def loss_value_function(graph, out):
    val = evaluate_graph(graph)
    return np.sum(val)

ptrblck · October 11, 2022, 5:56am

You cannot use numpy arrays directly as this will also break the computation graph. Either use PyTorch operations only or write a custom autograd.Function if you need to use numpy operations.

JFM_AI · October 11, 2022, 6:31am

Hi @ptrblck , thank you for your response. That’s true. I also tried return torch.sum(torch.tensor(VK, requires_grad=True)), but the computational graph breaks and the parameters don’t update as well.
However, if I do return (torch.sum(torch.exp(out)) - torch.sum(torch.tensor(VK))), then the parameters are updated. Although I find this pretty weird. Any suggestion?

ptrblck · October 11, 2022, 7:14am

Recreating a tensor will break the computation graph as already mentioned. If your second approach works fine I would guess that another tensor might be attached to a computation graph (out) while VK acts as a constant and won’t affect the gradient calculation.

JFM_AI · October 12, 2022, 12:05am

Thank you @ptrblck .

Yes, this is what I thought. out is a tensor, and it is constant. Then, I just use VK as a value to be substracted from out. The greate VK, the smaller the loss.

Thank you

JFM_AI · October 14, 2022, 12:39am

Ok, I found what’s the problem. The problem is that I was doing a random sampling in my log_softmax output layer and that, breaks the gradient. In order to update the parameters when we want to do a random sampling, we need to create a categorical distribution, sample from that distribution and then, update the parameters with the log derivative:

def train(data):
    optimizer.zero_grad()  # Clear gradients.
    out = model(data.x, data.edge_index)  # Perform a single forward pass.
    m = torch.distributions.categorical.Categorical(out) # Creates a categorical distribution
    L = m.sample() # Samples from the distribution. Each datapoint is assigned to a class.
    graph = build_graph(L)
    val = loss_value_function(graph)
    loss  = -m.log_prob(L) * val 
    loss.sum().backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss

def loss_value_function(graph):
    val = evaluate_graph(graph)
    return val

Hope this helps.

Cheers,