it might have to do with logsumexp trick applied in F.log_softmax,
I carried following experiment,
import torch.nn as nn, torch, torch.nn.functional as F
from math import exp, log
torch.set_printoptions(precision=15)
x = torch.randn(5); x
tensor([ 2.229876756668091, 0.264560282230377, -0.100190632045269,
0.228291451931000, -0.119905993342400])
a = torch.softmax(x, dim=0); a
tensor([0.681239962577820, 0.095449574291706, 0.066277280449867,
0.092049762606621, 0.064983405172825])
torch.log(a)
tensor([-0.383840680122375, -2.349157094955444, -2.713908195495605,
-2.385426044464111, -2.733623266220093])
while
F.log_softmax(x, dim=0)
gives
tensor([-0.383840620517731, -2.349157094955444, -2.713907957077026,
-2.385425806045532, -2.733623266220093])
we see a difference in values obtained,
torch.log(a) - F.log_softmax(x, dim=0)
tensor([-5.960464477539062e-08, 0.000000000000000e+00, -2.384185791015625e-07,
-2.384185791015625e-07, 0.000000000000000e+00])
the first case (for 3rd value) is equivalent to,
log(exp(x[2])/(exp(x[0]) + exp(x[1]) + exp(x[2]) + exp(x[3]) + exp(x[4])))
-2.7139080429149542
while the second case is equivalent to,
x[2] - x[0] - log((exp(0) + exp(x[1] - x[0]) + exp(x[2] - x[0]) + exp(x[3] - x[0]) + exp(x[4] - x[0])))
-2.713907957077026
I think it done to avoid exponential of a large number