I wrote a train routine that takes in an arbitrary model, data loader, and loss criterion, and which contains the following code:

for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
model.zero_grad()
output = model(data)
loss = criterion(output, target)*datasize
loss.backward()
optimizer.step()

Here, output is a 2D-tensor and target is a 1D tensor. The problem is that when using NLL loss as a criterion (say for MNIST classification), everything works fine as is, but if I use BCE loss (for some binary classification), Torch complains that the criterion requires both tensors to be of the same shape.

Since, for binary classification, output will be a 2D-tensor with size (B, 1) where B is the batch size, calling a simple squeeze() would be enough. But I want my code to work in both cases, i.e. BCE loss and NLL loss. I already tried to use squeeze() on the return value of the forward( ) pass inside my model, but this did not work either. What is the usual way to do this?

Yes, for BCE loss the model(input) and target tensor need to be of the same shape. See

Please note that BCE loss is appropriate for a multi-class multi-label setting while NLLLoss is appropriate for a multi-class single-label setting.

So, according to me only one of these shall be the appropriate one for the problem at hand. Is there anything that I am missing?
Could you elaborate on your task and why youâ€™d like to use both?

Well, I would like to write a routine that works on supervised classification tasks. I donâ€™t understand what you mean by â€śmulti-labelâ€ť. If I have N classes, I have N labels, one for each class.

For binary classification (2 classes and labels 0 or 1), I need to use BCE. For something like MNIST (N classes, N labels with N>2), I need to use NLL.

You could always add conditions and transform the outputs and targets to the desired dtype if needed. Usually, you donâ€™t care about these kind of abstractions, since your use case defines the actual criterion and allowing other loss functions to work with your data often doesnâ€™t make sense.

A multi-label multi-class problem is the one where where there are more than two (multiple) classes and a data point can belong to more than one classes at a time. BCEWithLogitsLoss (with no sigmoid() or softmax() ) is the right loss function for such tasks.

A single-label multi-class problem is the one where there are more than two (multiple) classes and a data point can belong to only one class. CrossEntropyLoss is the right loss function for such tasks.