How to implement the exactly same softmax as F.softmax by pytorch

Junwu_Weng · May 3, 2019, 9:25am

Hi everyone,

Recently I need to re-implement the softmax function to design my own softmax. I refer the codes on the Github and implemented one as shown below.

def own_softmax(self, x)
    
    maxes = torch.max(x, 1, keepdim=True)[0]
    x_exp = torch.exp(x-maxes)
    x_exp_sum = torch.sum(x_exp, 1, keepdim=True)

    return x_exp/x_exp_sum

However, after implementation I found that the results are not as good as the original one (F.softmax). So I am here to ask what is the difference between my implementation and the built-in function.

Thank you so much!

Junwu_Weng · May 3, 2019, 1:52pm

Anyone can help here?

Junwu_Weng · May 4, 2019, 2:33am

I have tried some other implementation like the following one,

def own_softmax(self, x)
    
    means = torch.mean(x, 1, keepdim=True)[0]
    x_exp = torch.exp(x-means)
    x_exp_sum = torch.sum(x_exp, 1, keepdim=True)

    return x_exp/x_exp_sum

and found that this implementation can achieve better accuracy. However, it is still not as good as the F.softmax. Anyone can help?

Junwu_Weng · May 4, 2019, 1:43pm

Anyone can help?

Junwu_Weng · May 5, 2019, 6:59am

so sad, so sad …

ptrblck · May 6, 2019, 12:48pm

Your custom function returns the same output as F.softmax:

x = torch.randn(5, 10)
output = F.softmax(x, 1)

maxes = torch.max(x, 1, keepdim=True)[0]
x_exp = torch.exp(x-maxes)
x_exp_sum = torch.sum(x_exp, 1, keepdim=True)
output_custom = x_exp/x_exp_sum

print(torch.allclose(output, output_custom))
> True
print(torch.sum(torch.abs(output-output_custom)))
> tensor(2.3108e-7)

nutszebra · May 6, 2019, 2:42pm

The output from your own_softmax is slightly different from torch.nn.functional.softmax .
This may be the reason why your own_softmax degrades the performance.

x = torch.randn(2,10)
h_own = own_softmax(x)
h = torch.nn.functional.softmax(x, 1)
print(h - h_own)

Junwu_Weng · May 7, 2019, 7:07am

Thank you so much. It seems that different centralization method for the network output score influence the softmax output a lot. I have tested the centralization using max value and mean value, and their output are quite different. I am wondering whether the mean one is more stable？

Junwu_Weng · May 7, 2019, 7:08am

Thank you so much for your reply.

ptrblck · May 7, 2019, 10:31am

If I swap torch.max for torch.mean, I get approx. the same accuracy.
Could you post the code you’ve used to check the behavior?

omri123 · January 16, 2024, 9:56am

Hi Junwu!
For some unique optimization I need to re-implement softmax.
I am afraid to see accuracy degradation. Do you see performance degradation in training or do you use your softmax for inference only? I think training is more sensitive to numerical problems.