Question about BCE* losses interface and features

Dear pytorch guys :slight_smile: ,
first of all thanks for developing such a conceptually smart, easy to use and nice ML framework.

I am a pytorch user since the first versions and I have few questions about how the NN losses are conceptually implemented:

  1. BCE and BCEWithLogitLoss require Float labels (also called targets if you prefer). Why is that, precisely? The CrossEntropyLoss, for example, requires Long labels (integers) and itā€™s a more general case than BCE, so why the BCE* need floats? This choice influences also the second point I am going to raise

  2. BCE and BCEWithLogit donā€™t (currently) support ignore_index, which is IMO an extremely useful feature. Implementing the ignore_index is easy, but the fact that these losses take in Float label make it a bit awkward to do, because due to numeric problems, the float representation of the ignore_index value could be very similar but different enough to fail the

mask = labels.ne(ignore_index)

clause that I could use to implement such a feature. (e.g. -1.0000000001 instead of -1).

  1. Minor issue: in the pytorch 1.1.0 BCEWithLogitLoss documentation, I think that there is a problem in the pos_weight example: the (10,64) tensor is said to contain 10 batches of 64 classes, but from what I see, if there are 64 classes, then 10 must be the samples (with batch omitted). Conversely, if 10 is the batch size, the 64 should be the samples per batch (with only 1 binary classification going on).

To wrap up my comments:
Would it then be possible maybe to uniform a bit more the ā€œinterfaceā€ of the losses functions, ensuring that:
-losses for classification use integer labels/targets
-all the losses for classification provide ignore_index
-all the losses for classification provide class_weights

thanks in advance and keep up with the good work!
cheers,
Daniele

1 Like

Hi Daniele,

I think the reason behind using a floating point target is given in the docs:

This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets t[i] should be numbers between 0 and 1.

I stumbled upon this issue a while ago and was mistakenly thinking its being only used for a binary or multi-label classification. :wink:

This would also answer the second point. Since there are no discrete classes, the implementation of an ignore_index wouldnā€™t really make sense.

The documentations mentions

target = torch.ones([10, 64], dtype=torch.float32) # 64 classes, batch size = 10

which seems correct.
Note that nn.BCEWithLogitsLoss can be used in a multi-label classification use case, i.e. a single sample has more than a single correct (active) class.
In this example all 64 classes would be active for all 10 sample in the batch.

thanks for your kind reply.

Yes, I noticed that the doc was mentioning the autoencoder use, but then what should I (we) use as loss for ā€œvanillaā€ classification? What is the ā€œofficialā€ recommendation?

I ask this because the fact that BCE* are ā€œopen to the useā€ in autoencoders are lowering their effective usefulness for classic classification tasks (e.g. no ignore index and float labels as input).

Wouldnā€™t be better to ā€œspecializeā€ these functions, for autoencoder and classification use? I suppose that the number of times they are used for vanilla classification >> the number of times they are used into autoencoders, since the first use case is the most likely frequent.

This is my personal opinion and others might have different preferences, so take it with a grain of salt :wink:

For a multi-class classification, I would use nn.CrossEntropyLoss, which also provides the ignore_index argument. This makes sense, as e.g. if Iā€™m dealing with 1000 classes, I might just want to ignore a certain one.

In a binary classification, you could still use nn.CrossEntropoyLoss with two outputs (possibly more, if you ignore this class) or alternatively nn.BCE(WithLogits)Loss.
An ignore_index argument doesnā€™t really make sense in the latter case, since we are dealing with float values, and we are just using a single output neuron, which should give us the probability (logit) of the positive class. Ignoring a class in a binary setup seems a bit strange, and it might be simpler to just calculate the loss of a single class instead (if thatā€™s the use case).

For a multi-label classification, I would use also use nn.BCE(WithLogits)Loss, where each neuron corresponds to the probability (logit) of the corresponding class.
Ignoring certain classes in this use case could in fact make sense.

I have a similar question and I was wondering if you could help me. assume that I have a multi-label classification task, but for each sample some of the labels are not known, NaN. having ignore_index in this case would help me masking the loss for each binary classification. since there is no ignore_index for BCELoss, what do you suggest I do to handle the missing labels?

thank you so much!

You could create the unreduced loss via reduction='none', create a mask with 1s for all valid outputs/targets and 0s for the invalid ones, multiply the loss with it, reduce it (e.g. via mean), and calculate the gradients via backward().

4 Likes

great idea! thanks for your help!