Hi,
I need to calculate the softmax of a large, high-rank tensor and the built-in method seems significantly slower than explicitly calculating it. I measured the speeds using torch.autograd.profiler
and am using a single K80 for calculations.
Built-in:
X = torch.randn((1000, 10, 100, 100)).to('cuda')
with torch.autograd.profiler.profile(use_cuda=True) as prof:
S = X.softmax(-1)
Explicit:
X = torch.randn((1000, 10, 100, 100)).to('cuda')
with torch.autograd.profiler.profile(use_cuda=True) as prof:
S = X.exp()
S = S / S.sum(-1, keepdim=True)
Here’s the profiler output for both (top: built-in, bottom: explicit)
The results are the same (up to some precision errors) for the same input.
Am I doing something wrong (either with the calculation or with the profiling)? Or is there some inefficiency in the built-in softmax?
System:
- Ubuntu 16.04
- PyTorch 0.4.1
- CUDA 9.2 / cuDNN 7
- 1x K80