In fact, when it’s the default value (None), the parameter weight is a vector which is full of (1,). I mean, weight = (1, 1, …, 1). The size of vector equals the number of batch-size.

Assume the size of batch is 4, and I set the weight = (w1, w2, w3, w4).
Then the value of loss function:
Loss = (loss_sample_1 * w1 + loss_sample_2 * w2 + loss_sample_3 * w3 + loss_sample_4 * w4) / (w_1 + w_2 + w_3 + w_4)
Here, each loss sample is also a vector, with size of number of output nodes.

Actually, it adjusts the size of the learning rate for each sample.

So the larger the value of weight parameter, the greater the impact to the Loss.

It is generally used when the size of various data sets is not balanced.

Is this the right understanding?
I tried to see the source code, but I could not find it.

The weight parameter is usually with respect to labels, not batch samples !
It will apply the weight depending on the ground truth label for the given sample.
It will indeed rescale the gradients and is very useful if your dataset is unbalanced for example.

In fact, it just adjusts the size of the label.
Maybe I can adjust the size of the label directly, instead of using this parameter?
Is this understanding right?

Yes, this is correct, but specifically forbinary_cross_entropy
(and binary_cross_entropy_with_logits). As Alban pointed
out it is not true for CrossEntropyLoss.

(As an aside, my guess is that when the weight parameter is None, there isn’t actually a weight vector of ones. I assume
that no multiplication by (default) weights occurs. Of course,
not multiplying by anything is equivalent to multiplying by one,
so the effect of not multiplying by anything and multiplying by
a default weight vector of ones is the same.)

If I understand what you’re saying, this isn’t quite right.

I’m assuming that by loss_sample you mean the (batch of
four) output vectors of your network. (These would often be
called predictions.) The loss is a measure of how much
your predictions differ from your “ground truth,” that is, your targets. I will use the term predict rather than loss_sample.

Then your weighted loss is:

loss = loss (predict_1, target_1) * w1 + loss (predict_2, target_2) * w2 + loss (predict_3, target_3) * w3 + loss (predict_4, target_4) * w4

Yes, you could say this (but I don’t think I would phrase it this way).

Yes.

Well, this would depend on what you mean by “various data sets.”

If you are training a classifier, and your training dataset has
many more samples for one class than another – that is, it is unbalanced – then yes, weights are often used to increase
the contribution of underrepresented classes to the loss (and
hence to the gradients used for training).

Basically yes, for weights that are applied to samples, such as
with binary_cross_entropy (but not exactly right for class
weights, such as with CrossEntropyLoss).

[qoute]
I tried to see the source code, but I could not find it.
[/quote]

Yes. CrossEntropyLoss (and the function form, cross_entropy,
you linked to) takes class weights (not sample weights).

Yes, because having an “output node” of size 100 implies that
you are building a 100-class classifier, so your class weights are
then a vector of 100 weights.

No. When you say label_vector, I assume you mean the
known, ground truth class labels for your training data. These
are what are often called the targets. For CrossEntropyLoss,
they have to be integer class labels in [0, nClass - 1].
If you multiply them by weights, they won’t be class labels
anymore.

Let predict_1 be the first of four prediction vectors (output
of your network) for your example with a batch size of four. target_1 is the first class label in your batch of four. You say
that your output is a vector of length 100, so this means that
you have 100 classes, so your class labels are in [0, 99],
and your weight vector is of length 100.

Then, for example, weight[17] is the weight for class 17.

Now, because these are class weights instead of sample weights,
we have for your example with a batch size of four:

loss = weight[target_1] * loss (predict_1, target_1) + weight[target_2] * loss (predict_2, target_2) + weight[target_3] * loss (predict_3, target_3) + weight[target_4] * loss (predict_4, target_4)

Remember, your targets are class-label integers in [0, 99]. So
they are valid indices into the weight vector of length 100.

Yes. I would probably phrase it as “it adjusts the contribution
to the loss of those samples with the specified label.”

You can’t. Again, the labels themselves are integer class labels.
They don’t really have sizes, and you can’t change them (without
breaking things).

(If, for some reason, you don’t want to use the weight parameter
to accomplish this reweighting, you would have to write your own
custom loss function.)

I don’t know why the weight parameter is used differently in binary_cross_entropy and CrossEntropyLoss. It is
somewhat inconsistent.

There are legitimate use cases for both sample weights and
class weights.

I tend to think the class-weight use case is more common,
even for a binary-classification problem. You can certainly
have unbalanced training data in the binary case, and it
would be convenient to be able to pass a weight parameter
to the binary loss function that reweights your class-0 samples
relative to your class-1 samples. (You can do this, of course,
using sample weights, but for every batch, you have to construct
your own sample-weight vector (of length batch-size) based on
the class (“0” or “1”) of each sample. This is straightforward,
but a bit of a nuisance.)

Note that sample weights are more general than class weights,
because, as outlined above, you can use sample weights to
weight samples based on their class. So this mechanism lets
you use class weights with BCELoss. In contrast, there is
no way to get sample weights using CrossEntropyLoss
(other than writing a custom loss function).

Class weights (if that’s what you want) are more convenient.
You only need one weight vector that you can reuse for all
of your batches.

You can pass in a weight vector when constructing a CrossEntropyLoss loss-function object. This is convenient
and makes sense because you will typically use the same
class-weight weight vector for all of your batches.

You can also pass in a weight vector when constructing a BCELoss (or BCEWithLogitsLoss). This is less likely to
make sense, because you are not so likely to want to use
the same sample-weight weight vector for each batch.
(And you can’t if the batch sizes are not the same.) So
if you want to use weights for binary classification, you
will generally have to use the function form of the loss, binary_cross_entropy (or binary_cross_entropy_with_logits).