`Softmax` before (or) after Loss calculation

I am a basic question.
Should softmax be applied after or before Loss calculation. I have seen many threads discussing the same topic about Softmax and CrossEntropy Loss. But my question is in general, i.e. regarding using Softmax with any loss function. So

  1. Is it a rule of thumb that softmax if used, it should only be used before ( or after) loss calculation.
  2. If it is not a rule of thumb, which gives better results. Applying before or After Loss calculation.

Like in the below code, Should the Softmax be applied at Line 1 (or) Line 2

_softmax = torch.nn.Softmax(dim = 1)
for epoch in range(num_epochs):
    for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode
            num_data = 0
            num_corrects = 0
            _loss = []
            for i, data in enumerate(tqdm(dataloader['phase'])):
                # Every data instance is an input + label pair
                inputs, true_labels = data

                # Zero your gradients for every batch!
                optimizer.zero_grad()

                with torch.set_grad_enabled(phase == 'train'):
                    # Make predictions for each batch
                    predictions = model(inputs)

    Line 1: ------- predictions = _softmax(predictions) ---------------       
                    # Compute the loss for each batch
                    loss = loss_fn(predictions, labels)
    Line 2: ------- predictions = _softmax(predictions) ----------------

                    # Calculate predicted labels for each batch
                    _, pred_labels = torch.max(predictions.data, 1)
            
                    if phase == 'train':
                        #Compute loss gradients
                        loss.backward()
                        # Adjust learning weights
                        optimizer.step()

                _loss.append(loss.item())

                # Calculate samples for each epoch
                num_data += true_labels.size(0)

                # Calculate number of correctly predicted labels for each epoch
                num_corrects += torch.sum(pred_labels == true_labels.data.to(device))
            epoch_loss = numpy.mean(_loss)
            epoch_accuracy = 100 * num_corrects / num_data

(Tell me if there is any step wrong while calculating epoch_loss or epoch_accuracy)

  1. It depends on the loss function and where it defines the softmax operation. The issue with F.softmax and nn.CrossEntropyLoss in PyTorch is that nn.CrossEntropyLoss applies F.log_softmax internally and thus no previous F.softmax operation should be used. It’s thus not the user’s choice if and where to use the softmax, but the loss function definition.
  2. Same as before: applying the unnecessary F.softmax before the internal F.log_softmax of nn.CrossEntropyLoss is wrong. Some users reported beneficial results which could have been also achieved by e.g. lowering the learning rate.
1 Like

Thanks @ptrblck. You are the best. Clear cut answer.

Sure, happy to help!
One addition: you are of course totally free to still apply F.softmax on the model’s output in case you want to use the probabilities in some debugging step to visualize these. You should however not pass these probabilities to e.g. nn.CrossEntropyLoss.

1 Like