Torch.max and softmax confusion

John_J_Watson · May 11, 2020, 12:39pm

I am quite new to pytorch, and I am going through the [transfer learning tutorials] on resnet(https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html), and I notice this line (which is also there in the inference mode):

outputs = model(inputs)
_, preds = torch.max(outputs, 1)

which, when I go through the docs, suggests that it is simply the max on each tensor along with the indices. So, preds just contains the indices, which indirectly represents the classes.

Now, when I look at mobilenet tutorial (I adapted it to output 20 classes and retrained the final layer for feature extraction), I see the line

# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
print(torch.nn.functional.softmax(output[0], dim=0))

Why is there this difference in resnet and mobilenet outputs? Strangely enough, when I simply use:

_, preds = torch.max(outputs, 1)

in my mobilenet transfer learning code, I seem to get the correct answers (i.e the classification is done correctly). So, I am now seriously confused as to why I would need the probablities as described in mobilenet, when I do not seem to need them in resnet.

Also, when I use these probabiliities via softmax and train, like so:

outputs = model(inputs)
outputs = torch.nn.functional.softmax(outputs, dim=1)
_, preds = torch.max(outputs, 1)

I get the same results as one without this softmax. so, I am not entirely sure what is the correct thing to do.

Any pointers would really help.

Thank you.

KFrank · May 11, 2020, 4:15pm

Hello John!

To answer your most concrete question first:

John_J_Watson:

Also, when I use these probabiliities via softmax and train, like so:
outputs = model(inputs)
outputs = torch.nn.functional.softmax(outputs, dim=1)
_, preds = torch.max(outputs, 1)

In this case preds will be the same whether you include softmax()
or remove it. This is because softmax() maps its (algebraically)
largest input value to the largest output value. Therefore the index
of the largest value won’t change.

(It’s not clear to me what you mean by “train.” If you pass outputs
to a loss function, call loss.backward(), and then take an optimizer
step, you will get different results if you leave out the softmax().)

As an aside, (with recent versions of pytorch) you can call
preds = torch.argmax (outputs, 1), if you prefer.

To comment a little more on this: If your model returns the output of a
Linear layer (without passing it through something like softmax()),
the values returned should be understood as raw-score logits that
run, in principle, from -inf to inf. Such logits are what is expected
by some loss functions, such as CrossEntropyLoss.

softmax() converts a set of logits to probabilities that run from
0.0 to 1.0 and sum to 1.0. If you wish to work with probabilities for
some reason, for example, if your loss function expects probabilities,
then you would pass your logits through softmax(). But, conceptually,
they’re just different ways of representing the same thing – the logits
and probabilities contain (almost) the same information.

Best.

K. Frank

John_J_Watson · May 11, 2020, 4:48pm

Thanks a tonne for this Frank - it really helps my understanding of things.

Could you please elaborate on this? Infact, this is what I am doing, and I am not sure what is the correct value to pass the loss function - raw logits or the values obtained by passing them to softmax(). What should be the expected difference? In the little tests I am doing, I do not actually see a difference and hence my question.

KFrank · May 11, 2020, 6:35pm

Hello John!

It depends on what you are trying to do, and what your loss function is.

It sounds as if you might be working on a twenty-class classification
problem. In such a case, you would typically use CrossEntropyLoss
as your loss function. Your model would output directly the results of
its last fully-connected Linear layer as logits. CrossEntropyLoss
would take these logits and a batch of integer class labels as its
inputs. You would not pass anything through softmax() (although
a softmax() is, in effect, built into CrossEntropyLoss).

I’m not sure exactly what you are doing, what your tests are, and how
you are looking for a difference.

But here is a (pytorch version 0.3.0) sample script that shows the
effect of adding softmax():

import torch
torch.__version__
torch.manual_seed (2020)

some_numbers = torch.autograd.Variable (torch.randn (5), requires_grad = True)
some_numbers

some_outputs = some_numbers
loss = some_outputs.sum()
loss
loss.backward()
some_numbers.grad

some_numbers.grad.zero_()
some_outputs = some_numbers
some_outputs = torch.nn.functional.softmax (some_outputs, 0)
loss = some_outputs.sum()
loss
loss.backward()
some_numbers.grad

And here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>> torch.manual_seed (2020)
<torch._C.Generator object at 0x000001BAF61C6630>
>>>
>>> some_numbers = torch.autograd.Variable (torch.randn (5), requires_grad = True)
>>> some_numbers
Variable containing:
 1.2372
-0.9604
 1.5415
-0.4079
 0.8806
[torch.FloatTensor of size 5]

>>>
>>> some_outputs = some_numbers
>>> loss = some_outputs.sum()
>>> loss
Variable containing:
 2.2911
[torch.FloatTensor of size 1]

>>> loss.backward()
>>> some_numbers.grad
Variable containing:
 1
 1
 1
 1
 1
[torch.FloatTensor of size 5]

>>>
>>> some_numbers.grad.zero_()
Variable containing:
 0
 0
 0
 0
 0
[torch.FloatTensor of size 5]

>>> some_outputs = some_numbers
>>> some_outputs = torch.nn.functional.softmax (some_outputs, 0)
>>> loss = some_outputs.sum()
>>> loss
Variable containing:
 1
[torch.FloatTensor of size 1]

>>> loss.backward()
>>> some_numbers.grad
Variable containing:
 0
 0
 0
 0
 0
[torch.FloatTensor of size 5]

As you can see adding the line

some_outputs = torch.nn.functional.softmax (some_outputs, 0)

causes the loss to change to 1 (from 2.2911), and the gradient to
change to [0, 0, 0, 0, 0] (from [1, 1, 1, 1, 1]).

Typically you would call loss.backward() to calculate the gradient
of your loss function with respect to your model parameters. Then you
would call optim.step() to adjust your model parameters a little bit in
the direction opposite to your gradient. So a different gradient means
different updated model parameters, and this means different results.

Best.

K. Frank

John_J_Watson · May 11, 2020, 9:57pm

Hello Frank,

Thanks so much for your detailed explanation. I completely understand what you are saying. Indeed, I should not be using softmax() when using it to compute loss.backward().

Thank you again for your patience explaining a nn to a beginner like me. Cant thank you enough.