# Using torch.nn.L1Loss to do regularization

Hello!

I’m doing a text classification project. Apart from cross-entropy loss, I also add one more regularization term to encourage the score of words given by the model (score(w)) to be close to the ideal scores(r).
score(w) is between -1 and 1, r is either -1 or 1.

score(‘lunch’)=0.01
r=1
I try to use L1 loss to encourage the score of ‘lunch’ to be 1.

Below is the code:

L1_loss=torch.nn.L1Loss(size_average=False)
r=torch.tensor([r]).float().reshape(-1,1)
loss=reg_strength*L1_loss(score(w),r)
loss.backward(retain_graph=True)

Here are the values of score(w) and r
r: tensor([[1.]])

r: tensor([[-1.]])

Here is the question: although I tried to encourage some terms to be 1, some terms to be -1, after training using 700 instances, I found that almost all the terms are encouraged to be 1. I’m not sure whether the code and the way to encourage are correct or not.
I would appreciate if you could help!

Thank you!

Hi Iris!

I’m skeptical that this is a good idea.

First, `CrossEntropyLoss` should be enough to train what I imagine
your use case to be. (By the way, what is your use case?)

Second, `CrossEntropyLoss` and `L1Loss` are rather different in
character and I don’t see them as blending together very effectively.

You say that your “ideal” scores are -1 or 1. This sounds like you are
performing a binary classification problem. You can treat this as a
two-class, multi-class classification problem and use `CrossEntropyLoss`
(but it would be slightly preferable to use `BCEWithLogitsLoss`), but you
wouldn’t want to feed a “score” that runs from -1 to 1 to `CrossEntropyLoss`.

Where does `score` come from, and do you pass `score` or something else
to `CrossEntropyLoss`?

You could use `L1Loss` to train a classifier, but some version of cross
entropy is likely to work much better.

Also, what is the purpose of `retain_graph = True`? If this so you can
combine your “regular” loss with `L1_loss`, you’re better off just summing
them together and calling `.backward()` once:

``````total_loss = regular_loss + L1_loss
total_loss.backward()
``````

I don’t see any obvious error that would encourage scores that should
have been near -1 to be near 1.

What fraction of your scores are positive vs. negative if you train just
with `CrossEntropyLoss`? What fraction if you train just with `L1Loss`?
Do you get almost all scores becoming positive only when you combine
the two losses together?

Best.

K. Frank

The use case is sentiment analysis, predicting a product review to be positive or negative.

Say the input is “This dress looks cute but uncomorfotable”, the label is negative. I have the following objective function:
L=L(f(x),y)+L(score(w),r(w))

The first loss term is the cross entropy loss which encourages the model prediction to be close to the label.
The second loss term is to encourage the score(w) to be close to r(w), the ideal score. The score(w) is obtained from an explanation method (a post-hoc method), which outputs the feature importance for each token given the model, so that we would know how the the prediction model captures each token.
In the example “This dress looks cute but uncomorfotable”, suppose score(‘cute’)=0.1, score(‘uncomorfotable’)=0.01, I want to use the second loss term to encourage score(‘cute’) to be 1 since it contribute to positive sentiment, and score(‘uncomorfotable’) to be -1.

The reason I use retain_graph = True is because currently I did loss.backward() for selected words (e.g. ‘cute’, ‘uncomorfotable’) iteratively, so I did backward multiple times for one input text, so I need to retain the graph.

As for the fraction of the scores, there are a lot more ideal scores r(w) be 1 than -1, so this might be the reason why almost all the terms are encouraged to be 1.

Thank you!
Iris

Hi Iris!

As I understand your use case – more or less – you have a “sentence,”
`x`, made up of “words,” `w`. You have a “sentence” model, `f (x)`, that
predicts the sentiment of the whole sentence, `x`. You also have a “word”
model, `score (w)`, that predicts the sentiment of a given word, `w`.

(The word model might be complicated and depend on some of the
processing performed by the sentence model, but that’s okay.)

Your ground-truth labels – for both the sentiment of sentences and of
words – are “negative” vs. “positive.”

So I would look at this as a collection of binary classifications. You
classify your sentence, `x`, as “class-0” (negative) vs. “class-1” (positive).
You also classify each “important” word in the sentence, `w`, as “class-0”
vs. “class-1”.

You say you are using cross entropy for the sentence classifier. This
makes sense, but because you have a binary (two-class) classification
problem, it would be slightly preferable to structure it as a binary problem
and use `BCEWithLogitsLoss` as your loss criterion (if you are not already
doing so).

But your word classifier is also performing a binary classification. One
could use `L1Loss` (or `MSELoss`, etc.) as a loss criterion, but experience
shows that, as a general rule, cross entropy should be your first choice
for classification problems. So I would also use `BCEWithLogitsLoss`
as the loss criterion for your word classifier.

So for the sentence `s`, with `w_i` being the important words in `s`, I would
structure the total loss something like:

``````total_loss = binary_cross_entropy (f (x), y) + a * sum_i (binary-cross-entropy (score (w_i), r (w_i)))
``````

where `a` is some sort of weighting / normalization constant. First, maybe
you would want to average over the `w_i`, rather than sum. So maybe
`a` should look something like `1 / N`, where `N` is the number of words
in the specific sentence `s`. Second, what should the relative weight
between the whole-sentence loss and the word losses be? This relative
weight would also be incorporated into `a`.

When structured like this, you can use a single backward pass, which
you would perform by calling `total_loss.backward()`. This will then
compute the gradients for the parameters of the sentence model `f`
as well as for any parameters in the word model, `score`. If the result
of `score` depends on some of the processing performed in `f`, the
gradients from that dependency will flow back properly to the parameters
of `f`.

This could well be why `score (w)` tends to predict mostly positive results.
The technical term for this situation is unbalanced data, meaning that
you have significantly more data samples in one class than in the other.

`BCEWithLogitsLoss` has a `pos_weight` constructor argument that
you can use to compensate for the class imbalance in your data. (Note,
if your sentence labels are also imbalanced – say, significantly more
positive-sentiment sentences than negative – you could also use
`pos_weight` in the sentence loss criterion.)

If you go the route of using `BCEWithLogitsLoss`, you will want both
your sentence and word models to output “scores” that run from `-inf`
to `inf` (that would be converted to the probability of being in the “positive”
class by passing them through `sigmoid()`, a step that you don’t need
to do explicitly). These “scores” would typically be the output of a final
`Linear` layer of the model, and become the so-called `input` of
`BCEWithLogitsLoss`.

Your labels – the so-called `target` – should be (floating-point) `0.0` for
sentences and words labelled as having negative sentiment (that is, in
the “negative” class), and `1.0` for sentences and words that have positive
sentiment (that is, in the “positive” class). (Note, these `target` labels could
actually range from `0.0` to `1.0` if it makes sense in your use case to label
your data with the strength of the sentiment, rather than just being “negative”
or “positive.”)

Good luck!

K. Frank