Clone and detach used properly in a loss function [FIXED]

Hi,

I have a use case where I’m trying to predict a few targets (6) at the same time. In my custom loss function implementation (called ‘custom_loss’ in the code shared below) I’m using tensor functions ‘clone’ and ‘detach’ in a way that might be incorrect.

My custom loss function is supposed to be a spearman correlation (ranked). As mentioned, to be able to use the spearman correlation on different targets at the same time, I might be wrongly using ‘clone’ and ‘detach’. After some trial and error, I’ve noticed that without them, I end up with my custom loss being nan.

With the implementation I’m sharing things seem to be working well but not sure if I’m doing something conceptually wrong. Any feedback or advice would be really appreciated. I’m not sure the way I’m solving the problem is negatively affecting the gradients somehow or not passing the proper information.

As a note, initial ‘targets’ and ‘preds’ tensors have the same size (obviously). The ‘targets’ tensor contains 6 targets and the number of rows is the batch size.

Thanks in advance.

def custom_loss(preds, targets):
    return - rank_corr(preds, targets)

def rank_corr(preds_original, targets):
    preds = rank_preds(preds_original)
    return pearson_corr(preds, targets)

def rank_preds(preds):
    preds_out = preds.detach().clone()
    num_cols = len(preds.T)
    for i in range(num_cols):
        pred = preds.T[i].reshape(1, -1)           
        rr = torchsort.soft_rank(pred, regularization_strength=.0001)
        pred = (rr - .5)/rr.shape[1]
        preds_out.T[i] = pred[0].detach().clone()
    return preds_out

def pearson_corr(preds, targets):
    preds = preds - preds.mean(dim=0)
    preds = preds / preds.norm(dim=0)
    targets = targets - targets.mean(dim=0)
    targets = targets / targets.norm(dim=0)
    return (preds * targets).sum()

I assume preds is the model output and you would like to call backward() on the final loss to calculate the gradients for the parameters in the model.
If so, then detaching the pred tensor would be wrong since you are cutting the computation graph and the model would never get any gradients from this loss.
I would also assume that the backward() call would fail as the loss would not be attached to any computation graph.

I assume preds is the model output and you would like to call backward() on the final loss to calculate the gradients for the parameters in the model.

Your assumption is totally right

If so, then detaching the pred tensor would be wrong since you are cutting the computation graph and the model would never get any gradients from this loss.

Any suggestion how I could implement this behaviour without cutting the computation graph so the model would get the gradients from this loss?

I would also assume that the backward() call would fail as the loss would not be attached to any computation graph.

The call itself does not return an error but if I don’t .detach().clone() I’ll start getting nan loss after a few epoch (not happening with the current implementation but as I imagined, it’s conceptually wrong).

Thanks in advance.

Not calling detach() would be the right approach and we should try to narrow down where the NaN values are coming from. Are you seeing the NaNs already in the model output (or any intermediate activations) or only after the loss calculation?

If I don’t call detach, after a few epoch, ‘preds_original’ tensor in ‘rank_corr’ function will be the first full NaN tensor. After that and for the following iterations, even the output of the model will be a full NaN tensor. I’m a bit lost on the reason why. Here is the training loop, pretty simple:

def train(model, optimizer, full_train_df, feature_names, target_names, batch_id_list):
    train_df = full_train_df.copy() 
    for epoch in range(5000):
        batch_count = 0
        acc_loss_train = 0
        for batch in batch_id_list:
            features = (torch.tensor(train_df[train_df.group_id == batch].filter(items=feature_names).values)).cuda()
            targets = (torch.tensor(train_df[train_df.group_id == batch].filter(items=target_names).values)).cuda()
            optimizer.zero_grad()
            model.train()
            outputs = model(features)
            loss = custom_loss(outputs, targets)
            acc_loss_train += loss       
            loss.backward()
            optimizer.step()
            batch_count += 1

        loss_train = acc_loss_train / batch_count      
        if epoch % 5 == 0:
            print(f'Epoch: {epoch}; Loss train: {loss_train.data.item()}')

Removing the .detach() worked. I had an issue creating a middle layer of the NN that was generating de NaNs.

Thanks for pointing to that possibility!

Good to hear it’s working now! :slight_smile: