There is a performance advantage (but no substantive numerical
advantage). The pow() (**) function is modestly more expensive
than exp(). (While log() is expensive, you only have to compute
it for a single scalar, and not the whole arange() tensor.)
For timing purposes I have made a modest performance improvement
to your ** version by getting rid of the reciprocal (1 / x) computation.
On my system – for large tensors – the “exp-log” version displays a
systematic performance advantage both on the cpu and gpu.
Here is a timing script:
import time
import math
import torch
print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_name())
n = 10
for device in ('cpu', 'cuda'):
#
# exp-log version
#
for d_model in (10000000, 100000000):
rng = torch.arange (0, d_model, 2).to (device)
# warm up
torch.cuda.synchronize()
for i in range (3):
xExp = torch.exp (rng * (-math.log (10000.0) / d_model))
# time
torch.cuda.synchronize()
start = time.time()
for i in range (n):
xExp = torch.exp (rng * (-math.log (10000.0) / d_model))
torch.cuda.synchronize()
t = (time.time() - start) / n
print ('exp-log, %s: d_model = %d, t = %f' % (device, d_model, t))
#
# pow version (avoid reciprocal)
#
for d_model in (10000000, 100000000):
rng = torch.arange (0, d_model, 2).to (device)
# warm up
torch.cuda.synchronize()
for i in range (3):
xPow = 10000.0 ** (-rng / d_model)
# time
torch.cuda.synchronize()
start = time.time()
for i in range (n):
xPow = 10000.0 ** (-rng / d_model)
torch.cuda.synchronize()
t = (time.time() - start) / n
print ('pow, %s: d_model = %d, t = %f' % (device, d_model, t))
#
print ('torch.allclose (xExp, xPow) =', torch.allclose (xExp, xPow))
And here is its output:
1.10.0
10.2
GeForce GTX 1050 Ti
exp-log, cpu: d_model = 10000000, t = 0.011005
exp-log, cpu: d_model = 100000000, t = 0.108074
pow, cpu: d_model = 10000000, t = 0.035099
pow, cpu: d_model = 100000000, t = 0.365038
torch.allclose (xExp, xPow) = True
exp-log, cuda: d_model = 10000000, t = 0.001014
exp-log, cuda: d_model = 100000000, t = 0.010088
pow, cuda: d_model = 10000000, t = 0.002260
pow, cuda: d_model = 100000000, t = 0.020973
torch.allclose (xExp, xPow) = True
Wow! Thank you very much K. Frank for taking time to tell the background and to implement the proof.
My assumption was that it might be due to a numerical stability improvement. But as your answer says, this is performance related.
Attaching my results:
1.11.0+cu113
11.3
Quadro P2000
exp-log, cpu: d_model = 10000000, t = 0.015999
exp-log, cpu: d_model = 100000000, t = 0.107793
pow, cpu: d_model = 10000000, t = 0.029426
pow, cpu: d_model = 100000000, t = 0.278817
torch.allclose (xExp, xPow) = True
exp-log, cuda: d_model = 10000000, t = 0.001003
exp-log, cuda: d_model = 100000000, t = 0.012005
pow, cuda: d_model = 10000000, t = 0.003000
pow, cuda: d_model = 100000000, t = 0.022995
torch.allclose (xExp, xPow) = True