First logarithm and then exponential, what's the reason?

Hi all,
I have a question regarding PositionalEncoder implementaton in official transformers tutorial *
There’s the following implementation there:

torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

However I’d expect to see something like

1/((10000.0 ** (torch.arange(0, d_model, 2) / d_model)))

What’s the advantage of first "log"ing and then "exp"onenting?

Hi Ali!

There is a performance advantage (but no substantive numerical
advantage). The pow() (**) function is modestly more expensive
than exp(). (While log() is expensive, you only have to compute
it for a single scalar, and not the whole arange() tensor.)

For timing purposes I have made a modest performance improvement
to your ** version by getting rid of the reciprocal (1 / x) computation.
On my system – for large tensors – the “exp-log” version displays a
systematic performance advantage both on the cpu and gpu.

Here is a timing script:

import time
import math

import torch
print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_name())

n = 10

for  device in ('cpu', 'cuda'):
    #    
    # exp-log version
    #
    for  d_model in (10000000, 100000000):
        rng = torch.arange (0, d_model, 2).to (device)
        # warm up
        torch.cuda.synchronize()
        for  i in range (3):
            xExp = torch.exp (rng * (-math.log (10000.0) / d_model))
        # time
        torch.cuda.synchronize()
        start = time.time()
        for  i in range (n):
            xExp = torch.exp (rng * (-math.log (10000.0) / d_model))
        torch.cuda.synchronize()
        t = (time.time() - start) / n
        print ('exp-log, %s:  d_model = %d, t = %f' % (device, d_model, t))
    #
    # pow version (avoid reciprocal)
    #
    for  d_model in (10000000, 100000000):
        rng = torch.arange (0, d_model, 2).to (device)
        # warm up
        torch.cuda.synchronize()
        for  i in range (3):
            xPow = 10000.0 ** (-rng / d_model)
        # time
        torch.cuda.synchronize()
        start = time.time()
        for  i in range (n):
            xPow = 10000.0 ** (-rng / d_model)
        torch.cuda.synchronize()
        t = (time.time() - start) / n
        print ('pow, %s:  d_model = %d, t = %f' % (device, d_model, t))
    #
    print ('torch.allclose (xExp, xPow) =', torch.allclose (xExp, xPow))

And here is its output:

1.10.0
10.2
GeForce GTX 1050 Ti
exp-log, cpu:  d_model = 10000000, t = 0.011005
exp-log, cpu:  d_model = 100000000, t = 0.108074
pow, cpu:  d_model = 10000000, t = 0.035099
pow, cpu:  d_model = 100000000, t = 0.365038
torch.allclose (xExp, xPow) = True
exp-log, cuda:  d_model = 10000000, t = 0.001014
exp-log, cuda:  d_model = 100000000, t = 0.010088
pow, cuda:  d_model = 10000000, t = 0.002260
pow, cuda:  d_model = 100000000, t = 0.020973
torch.allclose (xExp, xPow) = True

Best.

K. Frank

2 Likes

Wow! Thank you very much K. Frank for taking time to tell the background and to implement the proof.
My assumption was that it might be due to a numerical stability improvement. But as your answer says, this is performance related.
Attaching my results:

1.11.0+cu113
11.3
Quadro P2000
exp-log, cpu:  d_model = 10000000, t = 0.015999
exp-log, cpu:  d_model = 100000000, t = 0.107793
pow, cpu:  d_model = 10000000, t = 0.029426
pow, cpu:  d_model = 100000000, t = 0.278817
torch.allclose (xExp, xPow) = True
exp-log, cuda:  d_model = 10000000, t = 0.001003
exp-log, cuda:  d_model = 100000000, t = 0.012005
pow, cuda:  d_model = 10000000, t = 0.003000
pow, cuda:  d_model = 100000000, t = 0.022995
torch.allclose (xExp, xPow) = True