Using torch.nn.L1Loss to do regularization

Hello!

I’m doing a text classification project. Apart from cross-entropy loss, I also add one more regularization term to encourage the score of words given by the model (score(w)) to be close to the ideal scores(r).
score(w) is between -1 and 1, r is either -1 or 1.

e.g. input: I had lunch.
score(‘lunch’)=0.01
r=1
I try to use L1 loss to encourage the score of ‘lunch’ to be 1.

Below is the code:

L1_loss=torch.nn.L1Loss(size_average=False)
r=torch.tensor([r]).float().reshape(-1,1)
loss=reg_strength*L1_loss(score(w),r)
loss.backward(retain_graph=True)

Here are the values of score(w) and r
score(w): tensor(0.9046, grad_fn=)
r: tensor([[1.]])
loss: tensor(0.0095, grad_fn=)

score(w): tensor(0.8485, grad_fn=)
r: tensor([[-1.]])
loss: tensor(0.1849, grad_fn=)

Here is the question: although I tried to encourage some terms to be 1, some terms to be -1, after training using 700 instances, I found that almost all the terms are encouraged to be 1. I’m not sure whether the code and the way to encourage are correct or not.
I would appreciate if you could help!

Thank you!

Hi Iris!

I’m skeptical that this is a good idea.

First, CrossEntropyLoss should be enough to train what I imagine
your use case to be. (By the way, what is your use case?)

Second, CrossEntropyLoss and L1Loss are rather different in
character and I don’t see them as blending together very effectively.

You say that your “ideal” scores are -1 or 1. This sounds like you are
performing a binary classification problem. You can treat this as a
two-class, multi-class classification problem and use CrossEntropyLoss
(but it would be slightly preferable to use BCEWithLogitsLoss), but you
wouldn’t want to feed a “score” that runs from -1 to 1 to CrossEntropyLoss.

Where does score come from, and do you pass score or something else
to CrossEntropyLoss?

You could use L1Loss to train a classifier, but some version of cross
entropy is likely to work much better.

Also, what is the purpose of retain_graph = True? If this so you can
combine your “regular” loss with L1_loss, you’re better off just summing
them together and calling .backward() once:

total_loss = regular_loss + L1_loss
total_loss.backward()

I don’t see any obvious error that would encourage scores that should
have been near -1 to be near 1.

What fraction of your scores are positive vs. negative if you train just
with CrossEntropyLoss? What fraction if you train just with L1Loss?
Do you get almost all scores becoming positive only when you combine
the two losses together?

Best.

K. Frank

Thanks for your reply, Frank!

The use case is sentiment analysis, predicting a product review to be positive or negative.

Say the input is “This dress looks cute but uncomorfotable”, the label is negative. I have the following objective function:
L=L(f(x),y)+L(score(w),r(w))

The first loss term is the cross entropy loss which encourages the model prediction to be close to the label.
The second loss term is to encourage the score(w) to be close to r(w), the ideal score. The score(w) is obtained from an explanation method (a post-hoc method), which outputs the feature importance for each token given the model, so that we would know how the the prediction model captures each token.
In the example “This dress looks cute but uncomorfotable”, suppose score(‘cute’)=0.1, score(‘uncomorfotable’)=0.01, I want to use the second loss term to encourage score(‘cute’) to be 1 since it contribute to positive sentiment, and score(‘uncomorfotable’) to be -1.

The reason I use retain_graph = True is because currently I did loss.backward() for selected words (e.g. ‘cute’, ‘uncomorfotable’) iteratively, so I did backward multiple times for one input text, so I need to retain the graph.

As for the fraction of the scores, there are a lot more ideal scores r(w) be 1 than -1, so this might be the reason why almost all the terms are encouraged to be 1.

Thank you!
Iris

Hi Iris!

As I understand your use case – more or less – you have a “sentence,”
x, made up of “words,” w. You have a “sentence” model, f (x), that
predicts the sentiment of the whole sentence, x. You also have a “word”
model, score (w), that predicts the sentiment of a given word, w.

(The word model might be complicated and depend on some of the
processing performed by the sentence model, but that’s okay.)

Your ground-truth labels – for both the sentiment of sentences and of
words – are “negative” vs. “positive.”

So I would look at this as a collection of binary classifications. You
classify your sentence, x, as “class-0” (negative) vs. “class-1” (positive).
You also classify each “important” word in the sentence, w, as “class-0”
vs. “class-1”.

You say you are using cross entropy for the sentence classifier. This
makes sense, but because you have a binary (two-class) classification
problem, it would be slightly preferable to structure it as a binary problem
and use BCEWithLogitsLoss as your loss criterion (if you are not already
doing so).

But your word classifier is also performing a binary classification. One
could use L1Loss (or MSELoss, etc.) as a loss criterion, but experience
shows that, as a general rule, cross entropy should be your first choice
for classification problems. So I would also use BCEWithLogitsLoss
as the loss criterion for your word classifier.

So for the sentence s, with w_i being the important words in s, I would
structure the total loss something like:

total_loss = binary_cross_entropy (f (x), y) + a * sum_i (binary-cross-entropy (score (w_i), r (w_i)))

where a is some sort of weighting / normalization constant. First, maybe
you would want to average over the w_i, rather than sum. So maybe
a should look something like 1 / N, where N is the number of words
in the specific sentence s. Second, what should the relative weight
between the whole-sentence loss and the word losses be? This relative
weight would also be incorporated into a.

When structured like this, you can use a single backward pass, which
you would perform by calling total_loss.backward(). This will then
compute the gradients for the parameters of the sentence model f
as well as for any parameters in the word model, score. If the result
of score depends on some of the processing performed in f, the
gradients from that dependency will flow back properly to the parameters
of f.

This could well be why score (w) tends to predict mostly positive results.
The technical term for this situation is unbalanced data, meaning that
you have significantly more data samples in one class than in the other.

BCEWithLogitsLoss has a pos_weight constructor argument that
you can use to compensate for the class imbalance in your data. (Note,
if your sentence labels are also imbalanced – say, significantly more
positive-sentiment sentences than negative – you could also use
pos_weight in the sentence loss criterion.)

If you go the route of using BCEWithLogitsLoss, you will want both
your sentence and word models to output “scores” that run from -inf
to inf (that would be converted to the probability of being in the “positive”
class by passing them through sigmoid(), a step that you don’t need
to do explicitly). These “scores” would typically be the output of a final
Linear layer of the model, and become the so-called input of
BCEWithLogitsLoss.

Your labels – the so-called target – should be (floating-point) 0.0 for
sentences and words labelled as having negative sentiment (that is, in
the “negative” class), and 1.0 for sentences and words that have positive
sentiment (that is, in the “positive” class). (Note, these target labels could
actually range from 0.0 to 1.0 if it makes sense in your use case to label
your data with the strength of the sentiment, rather than just being “negative”
or “positive.”)

Good luck!

K. Frank