"pos_weight" and "weight" parameters in BCEWithLogitsLoss

Hi everyone,

I have gotten confused in understanding the “pos_weight” and “weight” parameters in BCEWithLogitsLoss. I’ve read all the relevant discussions in this regard, to my knowledge; however, still, I’ve not understood them completely.

Imagine that I have a multi-class, multi-label classification problem; my imbalanced one-hot coded dataset includes 1000 images with 4 labels with the following frequencies: class 0: 600, class 1: 550, class 2: 200, class 3: 100.
As I said, the targets are in a one-hot coded structure.
For instance, the target [0, 1, 1, 0] means that classes 1 and 2 are present in the corresponding image.

Well, in order to calculate the BCEWithLogitsLoss concerning the data imbalance, one way is that suggested in this post: Multi-Label, Multi-Class class imbalance - #2 by ptrblck. Here, we calculate the class weights by inverting the frequencies of each class, i.e., the class weight tensor in my example would be: torch.tensor ([1/600, 1/550, 1/200, 1/100]). After that, the class weight tensor will be multiplied by the unreduced loss and the final loss would be the mean of this tensor.

However, as far as I know, the pos_weight parameter of the BCEWithLogitsLoSS could also be used in this case. Here is my question:
to my knowledge, the two tensors (class weight tensor in the previous paragraph and the pos_weight tensor) are totally different. For each class, the number of positive and negative samples should be calculated and the num_negative/ num_positive would be the pos_weight for that particular class. In my example, the pos_weight of class 0 would be (1000-600)/600 = 0.67, the pos_weight of class 1 would be (1000-550)/550 = .82, the pos_weight of class 2 would be (1000-200)/200 = 4, and the pos_weight of class 3 would be (1000-100)/100 = 9. Thus, the pos_weight tensor would be torch.tensor ([.67, .82, 4, 9]). Is this way of calculating pos_weight tensor the right one? If the answer is yes, I think that the previous method (I mean calculating the class weights and multiplying it with the unreduced loss) would be more convenient in the case of a dataset with a large number of labels since we should only invert the frequencies, am I right?

Also, another question is about the weight parameter of the BCEWithLogitsLoss. As represented in the formula, the weight parameter is a tensor that is multiplied by the whole loss, not merely the positive targets (as opposed to pos_weight). My question is that how the weight parameter tensor is different from the class weights tensor, considering that the class weights tensor is similarly multiplied by the whole loss. However, it is said that the weight parameter tensor is of size nbatch, and I do not understand what its function is.

I deeply appreciate your consideration.

Hi Ali!

Let me answer a couple of your specific questions first and then explain
how I look at it.

In this case the current (1.9.0) documentation for BCEWithLogitsLoss
is wrong (or at least misleading). Quoting:

weight (Tensor, optional) – a manual rescaling weight given to the loss of each batch element. If given, has to be a Tensor of size nbatch.

On the contrary, weight can have other shapes – for example, it can
have the same shape as input. (As near as I can tell, the precise
requirement is that the shape of weight and input need to be

The documentation for the functional version,
binary_cross_entropy_with_logits(), comes closer to being correct:

weight (Tensor, optional) – a manual rescaling weight if provided it’s repeated to match input tensor shape

I think that what you are calling the “class weights tensor” is the
weight tensor. (Any imagined difference would arise from how
you interpret them based on your use case.)

Please note that BCEWithLogitsLoss takes four tensor arguments:
weight, pos_weight, input, and target. The first two are passed
in when BCEWithLogitsLoss's constructor is called to instantiate a
loss-function object, and the second two are passed in when the
resulting loss-function object is called. There is no separate “class
weights” argument.

Now to explain my understanding:

As far as I can tell, weight, pos_weight, input, and target need
only be broadcastable to one another. To simplify the discussion, let’s
assume that they are all of the same shape.

BCEWithLogitsLoss doesn’t make any distinction, for example,
between labels / predictions for a specific class and specific samples
within a batch. It simply applies the BCEWithLogitsLoss formula
on an element-wise basis, including the weight and pos_weight
weightings, also on an element-wise basis. This produces a tensor
of element-wise loss values of the same shape as input (and the
other arguments) that is then reduced (or not) according to the value
of reduction.

Consider the following:

>>> import torch
>>> torch.__version__
>>> _ = torch.manual_seed (2021)
>>> nBatch = 2
>>> nClass = 3
>>> nSomethingElse = 5
>>> weight = torch.rand (nBatch, nClass, nSomethingElse)
>>> pos_weight = torch.rand (nBatch, nClass, nSomethingElse)
>>> input = torch.randn (nBatch, nClass, nSomethingElse)
>>> target = torch.rand (nBatch, nClass, nSomethingElse)
>>> torch.nn.BCEWithLogitsLoss (weight = weight, pos_weight = pos_weight) (input, target)
>>> torch.nn.BCEWithLogitsLoss (weight = weight.flatten(), pos_weight = pos_weight.flatten()) (input.flatten(), target.flatten())

BCEWithLogitsLoss doesn’t care about any particular dimensions or
assign them meanings like “batch” or “class” or “height” or “width” – it
simply performs the (weighted) element-wise loss computation and
then reduces the result.

Where does the notion of a “class weights tensor” come from?
Consider a use case where we are performing a multi-label,
nClass-class loss calculation for a batch of nBatch samples:

Let input and target both have shape [nBatch, nClass]. If weight
has shape [nClass] it will be broadcast to match the shape of input,
and, indeed, the elements of weight will be class weights in the loss
calculation – but not because BCEWithLogitsLoss knows or cares
about what you might interpret as a “class” dimension. Rather, the
elements of weight become class weights just because that’s how
the tensor elements line up in the element-wise computation after

Note, if you modify this example to try to pass in a 1d tensor of sample
weights (of shape [nBatch]), as suggested by the documentation quoted
above, it won’t work.


>>> import torch
>>> torch.__version__
>>> _ = torch.manual_seed (2021)
>>> nBatch = 2
>>> nClass = 3
>>> sample_weights = torch.rand (nBatch)
>>> class_weights = torch.rand (nClass)
>>> input = torch.randn (nBatch, nClass)
>>> target = torch.rand (nBatch, nClass)
>>> torch.nn.BCEWithLogitsLoss (weight = class_weights) (input, target)
>>> torch.nn.BCEWithLogitsLoss (weight = sample_weights) (input, target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<path_to_pytorch>\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "<path_to_pytorch>\torch\nn\modules\loss.py", line 716, in forward
  File "<path_to_pytorch>\torch\nn\functional.py", line 2960, in binary_cross_entropy_with_logits
    return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 1

If you want sample weights (rather than class weights) in this use
case, you would have to unsqueeze() sample_weights so that
broadcasting lines the weights up the way you want:

>>> torch.nn.BCEWithLogitsLoss (weight = sample_weights.unsqueeze (1)) (input, target)

One last clarifying comment:

Although we often use BCEWithLogitsLoss with target values
(ground-truth labels) that are binary no-yes labels (expressed as
0.0-1.0 floating-point numbers), BCEWithLogitsLoss is more
general and accepts a probabilistic target whose elements are
floating-point values that run from 0.0 to 1.0 and represent the
probability that the sample in question is in class-“1”.

Is this more general case we don’t have samples that are purely
“negative” or “positive,” so, strictly speaking, pos_weight doesn’t
weight the “positive” samples. Rather, it weights the “positive” part of
the binary-cross-entropy formula used for each individual element-wise
loss computation.

An aside about terminology: This is not “one-hot” encoding (and, as a
rule of thumb, there’s never really any reason to use one-hot encoding
with pytorch). You have a multi-label use case and your sample labels
are “multi-hot encoded,” if you will.

The term “one-hot encoding” is often used imprecisely, but doing so
can be quite misleading. A one-hot encoded single-label class label
is a vector where exactly one element is 1 and all the others are 0.


K. Frank


Hi @KFrank,

I am deeply grateful for your kind and clear responses and sincerely appreciate your help.

Your comments clarified most of my problems. I understood that as far as pos_weight and weight parameters are broadcastable to inputs and targets, no more strict consideration about their size is required. Actually, what the PyTorch documentation has mentioned about the weight parameter size seems to be wrong!

So, as you said, the tensor given to the weight parameter is the same as class_weights tensor in my case. Actually, final_loss1 and loss2 are the same here:

num_batch = 2
num_class = 3
targets = torch.randn (num_batch, num_class)
inputs = torch.randn (num_batch, num_class)
class_weights = torch.tensor ([1., 2., 3.])
criterion1 = torch.nn.BCEWithLogitsLoss (reduction = 'none')
criterion2 = torch.nn.BCEWithLogitsLoss (weight = class_weights)

loss1 = criterion1 (inputs, targets)
final_loss1 = (loss1 * class_weights).mean ()

loss2 = criterion2 (inputs, targets)

It was an interesting comment which I have not considered, but I have a question in this regard.
In this more general case of BCEWithLogitsLoss, how can I calculate the pos_weight tensor exactly?

Also, regarding my first post, was the way of calculating the pos_weight tensor in the case of a binary target correct?

Yes. You are right and I made a mistake here.

You mentioned that you have 1000 images, per se, my training data have around 10000 images and validation data have 5000 images. So would the calculation of pos_weights and class_weight differ for validation and training dataset?

Or the 1000 images refer to the number of samples present in a batch?

Did you use any other approach for calculating the pos_weights?

Did you find another way to calculate pos_weight or the one you already mentioned is correct?