Explicit softmax calculation faster than built-in method?

Hi,

I need to calculate the softmax of a large, high-rank tensor and the built-in method seems significantly slower than explicitly calculating it. I measured the speeds using torch.autograd.profiler and am using a single K80 for calculations.

Built-in:

X = torch.randn((1000, 10, 100, 100)).to('cuda')
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    S = X.softmax(-1)

Explicit:

X = torch.randn((1000, 10, 100, 100)).to('cuda')
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    S = X.exp()
    S = S / S.sum(-1, keepdim=True)

Here’s the profiler output for both (top: built-in, bottom: explicit)

The results are the same (up to some precision errors) for the same input.

Am I doing something wrong (either with the calculation or with the profiling)? Or is there some inefficiency in the built-in softmax?

System:

  • Ubuntu 16.04
  • PyTorch 0.4.1
  • CUDA 9.2 / cuDNN 7
  • 1x K80

cc @apaszke for softmax perf.

The naive implementation is numerically unstable. Our builtin involves e.g. computing a max before the actual softmax which might end up being slightly more expensive.

Thanks, that makes sense.

I just did the test again, subtracting the max from X:

X = X - X.max(-1, keepdim=True)[0]
X = X.exp()
X = X / X.sum(-1, keepdim=True)

and I get the same time for the explicit calculation and the builtin version.