Hi everyone,
I am working on a multi-label text classification with XLNet and I am using BCEWithLogitsLoss as loss function (sigmoid + CrossEntropy).
I am not satisfied of the model performance and one of the modifications I tried is to change the loss function to the following one that is not implemented in pytorch, it has 3 steps:
1- Apply sigmoid to output logits
2- Select active classes using dynamic threshold ‘Mean + Standard deviation’ (i.e. class is active (=1) if its value is greater than threshold, (=0) otherwise)
3- Apply hamming loss (normalized sum of different class values (xor) in predicted and target output)
I tried to implement it with simple function using only torch operations so that the back-propagation could be automatically implemented. Here is my code:
def hamming_loss(logits, labels):
"""
Returns Hamming Loss of logits and labels
"""
# logits and labels of shape [batch_size, number_output_classes]
batch_size = len(labels)
num_classes = len(labels[0])
sigmoid = torch.nn.Sigmoid()
# Applying Sigmoid to output logits
logits = sigmoid(logits)
# Selecting active classes in logits using dynamic threshold MpSD (Mean plus Std Dev)
for i in range(batch_size):
threshold = logits[i].mean() + logits[i].std()
for j in range(num_classes):
# if logits[i][j]>threshold logits[i][j]=1, else 0
logits[i][j] = torch.floor(logits[i][j]-threshold+1)
# Computing hamming loss
hamming_loss = torch.tensor(1, dtype=torch.float, requires_grad=True)
for i in range(batch_size):
for j in range(num_classes):
if (logits[i][j]!=labels[i][j]):
torch.add(hamming_loss, 1)
torch.div(hamming_loss, (batch_size*num_classes))
return hamming_loss
Now I want to know if my approach is correct and how to make sure the back-propagation is well done and there is no operation untracked??
Also, I can run the training using this loss function without code error but I have CUDA out of memory issue
CUDA out of memory. Tried to allocate 168.00 MiB (GPU 1; 31.75 GiB total capacity; 24.93 GiB already allocated; 67.69 MiB free; 30.70 GiB reserved in total by PyTorch
could this change be the source of the problem (the gradients stored being huge)? and how can I solve it?
Could you please help me out !!