# First logarithm and then exponential, what's the reason?

Hi all,
I have a question regarding PositionalEncoder implementaton in official transformers tutorial *
There’s the following implementation there:

``````torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
``````

However I’d expect to see something like

``````1/((10000.0 ** (torch.arange(0, d_model, 2) / d_model)))
``````

What’s the advantage of first "log"ing and then "exp"onenting?

Hi Ali!

There is a performance advantage (but no substantive numerical
advantage). The `pow()` (`**`) function is modestly more expensive
than `exp()`. (While `log()` is expensive, you only have to compute
it for a single scalar, and not the whole `arange()` tensor.)

For timing purposes I have made a modest performance improvement
to your `**` version by getting rid of the reciprocal (`1 / x`) computation.
On my system – for large tensors – the “exp-log” version displays a
systematic performance advantage both on the cpu and gpu.

Here is a timing script:

``````import time
import math

import torch
print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_name())

n = 10

for  device in ('cpu', 'cuda'):
#
# exp-log version
#
for  d_model in (10000000, 100000000):
rng = torch.arange (0, d_model, 2).to (device)
# warm up
torch.cuda.synchronize()
for  i in range (3):
xExp = torch.exp (rng * (-math.log (10000.0) / d_model))
# time
torch.cuda.synchronize()
start = time.time()
for  i in range (n):
xExp = torch.exp (rng * (-math.log (10000.0) / d_model))
torch.cuda.synchronize()
t = (time.time() - start) / n
print ('exp-log, %s:  d_model = %d, t = %f' % (device, d_model, t))
#
# pow version (avoid reciprocal)
#
for  d_model in (10000000, 100000000):
rng = torch.arange (0, d_model, 2).to (device)
# warm up
torch.cuda.synchronize()
for  i in range (3):
xPow = 10000.0 ** (-rng / d_model)
# time
torch.cuda.synchronize()
start = time.time()
for  i in range (n):
xPow = 10000.0 ** (-rng / d_model)
torch.cuda.synchronize()
t = (time.time() - start) / n
print ('pow, %s:  d_model = %d, t = %f' % (device, d_model, t))
#
print ('torch.allclose (xExp, xPow) =', torch.allclose (xExp, xPow))
``````

And here is its output:

``````1.10.0
10.2
GeForce GTX 1050 Ti
exp-log, cpu:  d_model = 10000000, t = 0.011005
exp-log, cpu:  d_model = 100000000, t = 0.108074
pow, cpu:  d_model = 10000000, t = 0.035099
pow, cpu:  d_model = 100000000, t = 0.365038
torch.allclose (xExp, xPow) = True
exp-log, cuda:  d_model = 10000000, t = 0.001014
exp-log, cuda:  d_model = 100000000, t = 0.010088
pow, cuda:  d_model = 10000000, t = 0.002260
pow, cuda:  d_model = 100000000, t = 0.020973
torch.allclose (xExp, xPow) = True
``````

Best.

K. Frank

2 Likes

Wow! Thank you very much K. Frank for taking time to tell the background and to implement the proof.
My assumption was that it might be due to a numerical stability improvement. But as your answer says, this is performance related.
Attaching my results:

``````1.11.0+cu113
11.3