CrossentropyLoss and use of the result of function related to initial class values

Hi,

I have a multiclass classification problem in NLP. For simplicity, let us have just three target classes: ″SPORT″, ″CULTURE″, ″TRAVEL″. Also, let these classes have corresponding labels: 0,1,2.

Let us say that we want during training to add penalization in cross-entropy loss according to some relationship between two initial classes . For example (just dummy example), if the predicted value has more letters than the expected value, we want to add ‘some value’ in the regularization part of the cross-entropy loss.

For example, let us have some values for a batch of 4 which are input in (only custom?) cross-entropy class:

y: tensor([1,0,2,2], device = ‘cuda:0’)

y_hat: tensor([1,0,2,1], device = ‘cuda:0’)

In this case, we would like to compare initial values ″CULTURE″ and ″TRAVEL″, and since there is the difference in length, we would add some penalization factor – the value is not important at this moment.

The fundamental question behind this example is – how to use the value of a custom function in cross-entropy function if that function should work on starting labels and return the value to the loss function as a regularization parameter?

I would like to hear advice regarding solving this kind of problem. Some small code snippets would be more than welcome.

Thank you

Hi,

I am not sure to understand exactly what you want to do here. But if you want to add special penalties to your loss, you can simply add them before calling backward:

loss = crit(out, target)
final_loss = loss + penalty(out, target)
final_loss.backward()

Hi,

thank you for your reply and time. I appreciate it.

I think I did not give a good example to share my thoughts. I’m going to try to correct it. Let us have a similar problem with some custom distance between labels, let us call it custom Hamming distance (even in this example I’ll use ‘plain’ Hamming distance).

Let’s say we have classes ‘ABC’, ‘CBA’, ‘BAC’. Also, let these classes have corresponding labels: 0,1,2.

For example, let us have some values for a batch of 4 which are input in (only custom?) cross-entropy class:

y: tensor([1,0,2, 2 ], device = ‘cuda:0’)

y_hat: tensor([1,0,2, 1 ], device = ‘cuda:0’)

In this case, we would like to compare initial values ″BAC″ and ″ABC″, and since there is the order of chars in a string, I would give some penalization, for example, 0.9*2 (since the Hamming distance is 2). I would like that parameter helps the model to learn better.

So, that function could be something like:

def my_loss(output, target): #let’s call  hamming distance 
output = getKeysByValue(label_dict,output)[0]
target = getKeysByValue(label_dict,target)[0]
loss = torch.tensor(dt.hamming.normalized_similarity(output,target),dtype=torch.float)
return loss
...
def li_regularizer(net, loss, y_pred, y):
    ...
    my_sum =torch.tensor(0.0, requires_grad=True)
    for i in range(batch_size):
        my_sum =torch.sum(my_loss(y[i],t_pred[i]).clone().detach())
    avg_hamm_dist = torch.tensor((my_sum/num_examples))
    return (avg_hamm_dist*0.9)

and i would call it in training :

loss = F.cross_entropy(y_pred, y)
loss = loss+li_regularizer(model, loss, y_pred, y)
loss.backward()

However, it does not work (it takes a lot of time just for one epoch of a small amount of data - about 1.5h per epoch and it doesn’t have effect on learning), and I am not sure that I have chosen the right path. I also get the warning:

<ipython-input-23-afb6d9389b03>:8: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  avg_hamm_dist = torch.tensor((my_sum/num_examples))

I appreciate every advice.

thx

I see a few things:

  • If you want a function to be differentiable with autograd, you have to use only torch Tensor and torch ops. So if you extract things as python number or numpy array, it won’t work.
  • torch.tensor() creates a Tensor based on the raw data that is given to it. In particular, it will also not propagate gradient as it considers these data as plain numbers and not Tensors.