Implementing new loss function (sigmoid+thresh+hamming loss)

hleb · July 9, 2020, 9:29am

Hi everyone,

I am working on a multi-label text classification with XLNet and I am using BCEWithLogitsLoss as loss function (sigmoid + CrossEntropy).

I am not satisfied of the model performance and one of the modifications I tried is to change the loss function to the following one that is not implemented in pytorch, it has 3 steps:

1- Apply sigmoid to output logits
2- Select active classes using dynamic threshold ‘Mean + Standard deviation’ (i.e. class is active (=1) if its value is greater than threshold, (=0) otherwise)
3- Apply hamming loss (normalized sum of different class values (xor) in predicted and target output)

I tried to implement it with simple function using only torch operations so that the back-propagation could be automatically implemented. Here is my code:

def hamming_loss(logits, labels):
    """
    Returns Hamming Loss of logits and labels
    """
    # logits and labels of shape [batch_size, number_output_classes]
    batch_size = len(labels)
    num_classes = len(labels[0])
    sigmoid = torch.nn.Sigmoid()

    # Applying Sigmoid to output logits
    logits = sigmoid(logits)

    # Selecting active classes in logits using dynamic threshold MpSD (Mean plus Std Dev)
    for i in range(batch_size):
        threshold = logits[i].mean() + logits[i].std()
        for j in range(num_classes):
            # if logits[i][j]>threshold logits[i][j]=1, else 0
            logits[i][j] = torch.floor(logits[i][j]-threshold+1)

    # Computing hamming loss
    hamming_loss = torch.tensor(1, dtype=torch.float, requires_grad=True)
    for i in range(batch_size):
        for j in range(num_classes):
            if (logits[i][j]!=labels[i][j]):
                torch.add(hamming_loss, 1)
    torch.div(hamming_loss, (batch_size*num_classes))

    return hamming_loss

Now I want to know if my approach is correct and how to make sure the back-propagation is well done and there is no operation untracked??

Also, I can run the training using this loss function without code error but I have CUDA out of memory issue

CUDA out of memory. Tried to allocate 168.00 MiB (GPU 1; 31.75 GiB total capacity; 24.93 GiB already allocated; 67.69 MiB free; 30.70 GiB reserved in total by PyTorch

could this change be the source of the problem (the gradients stored being huge)? and how can I solve it?

Could you please help me out !!

ptrblck · July 11, 2020, 8:23am

Your loss implementation won’t work, unfortunately.
You are creating hamming_loss as a new leaf variable, which is not connected to the computation graph. You are using logits to manipulate hamming_loss, but the future backward() call will only calculate the gradients up the the hamming_loss creation, not the complete model (i.e. Autograd will not calculate the gradients for any parameters, which were used to create logits).

Also, torch.add and torch.div are not inplace methods, so you are currently not using the result at all. Inplace operations use an underscore (such as tensor.add_()), but might create errors during the backward call, if you have overwritten an activation, which would be needed for the gradient calculation.

To fix this you would have to calculate the hamming_loss from operations on logits (assuming that logits is the model output).
Also note that threshold operations (such as torch.floor) would create mostly zero gradients.