Dimension for logsoftmax

NOP · June 26, 2019, 5:50pm

class M(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(50*50,100)
        self.l2 = nn.ReLU()
        self.l3 = nn.Linear(100,100)
        self.l4 = nn.Tanh()        
        self.l5 = nn.Linear(100,10) 
        self.l6 = nn.LogSoftmax()

Having module M, if I don’t set
self.l6 = nn.LogSoftmax(dim=0)
or
self.l6 = nn.LogSoftmax(dim=1)

I get the warning:

UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.

What does it mean to set dim=0 and what dim=1?

ptrblck · June 26, 2019, 8:03pm

The dim argument defines which dimension should be used to calculate the log softmax, i.e. in which dimension the class logits are located.
Have a look at this small example using softmax:

x = torch.randn(5, 3)
x0 = F.softmax(x, dim=0)
print(x0)
> tensor([[0.1313, 0.0170, 0.4122],
        [0.0167, 0.6336, 0.0440],
        [0.1764, 0.0804, 0.3689],
        [0.4540, 0.0501, 0.0967],
        [0.2217, 0.2189, 0.0782]])
print(x0.sum(0))
> tensor([1.0000, 1.0000, 1.0000])

x1 = F.softmax(x, dim=1)
print(x1)
> tensor([[0.2528, 0.0482, 0.6990],
        [0.0169, 0.9438, 0.0393],
        [0.2847, 0.1908, 0.5245],
        [0.7409, 0.1202, 0.1389],
        [0.3620, 0.5255, 0.1124]])
print(x1.sum(1))
> tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000])

As you can see, the sum of all probabilities will be 1. in the specified dimension.

In your use case, you should use dim=1 to calculate the log probabilities for each sample in the batch over all classes (which are in dim1).

NOP · June 27, 2019, 5:18am

OK, just to clarify,

the dimension 0 would be the number of batches (5 in your case),
dimension 1 is number of samples in the batch (3 in your case)
so I set dim=1 to calculate logits (log probabilities) of each sample in the batch).

ptrblck · June 27, 2019, 8:49am

Dimension 0 is the batch dimension and gives the number of samples in the current batch.
You can consider x being a single batch of samples.
Dimension 1 is the feature dimension in my use case and gives the number of different features.

ananda2020 · June 22, 2020, 5:52pm

Hi,
I have a question. What I understood from this question is that the softmax layer is a classifier of 10 classes. So when you are saying - for this case dim = 1 is to be used- so can I generalize that for classification problem one should use dim =1 or I am understanding wrong.
Could you please give an example where dim =0 can be used.
Thanks in advance.

ptrblck · June 23, 2020, 1:18am

The majority of PyTorch layers use tensors with the batch dimension in dim0.
The typical multi-class classification output would have a shape of [batch_size, nb_classes], and you would calculate the probability for each class in each sample:

batch_size = 2
nb_classes = 3
x = torch.randn(batch_size, nb_classes)
prob = F.softmax(x, dim=1)
print(prob)
> tensor([[0.6935, 0.1843, 0.1223],
          [0.8212, 0.0705, 0.1083]])

Here you can see that for sample0, class0 has a probability of 69.35%, class1 18.43%, and class2 12.23%.

If you are using F.softmax or F.log_softmax with dim=0, you would calculate the (log) probability in the batch dimension.

prob = F.softmax(x, dim=0)
print(prob)
> tensor([[0.2748, 0.5397, 0.3364],
          [0.7252, 0.4603, 0.6636]])

Now you are looking at: for class0, sample0 has a probability of 27.48%, while sample1 has 72.52%.

RNNs are an exception and are using the temporal dimension in dim0, so it might depend on your use case, if you want to apply the (log)softmax in this dimension.

ananda2020 · June 23, 2020, 3:16am

Thank you ptrblck for such a clear explanation. I am learning so much from this forum.
Regards,
ananda2020