Hi,

I need to calculate the softmax of a large, high-rank tensor and the built-in method seems significantly slower than explicitly calculating it. I measured the speeds using `torch.autograd.profiler`

and am using a single K80 for calculations.

Built-in:

```
X = torch.randn((1000, 10, 100, 100)).to('cuda')
with torch.autograd.profiler.profile(use_cuda=True) as prof:
S = X.softmax(-1)
```

Explicit:

```
X = torch.randn((1000, 10, 100, 100)).to('cuda')
with torch.autograd.profiler.profile(use_cuda=True) as prof:
S = X.exp()
S = S / S.sum(-1, keepdim=True)
```

Here’s the profiler output for both (top: built-in, bottom: explicit)

The results are the same (up to some precision errors) for the same input.

Am I doing something wrong (either with the calculation or with the profiling)? Or is there some inefficiency in the built-in softmax?

System:

- Ubuntu 16.04
- PyTorch 0.4.1
- CUDA 9.2 / cuDNN 7
- 1x K80