OverflowError: (34, 'Numerical result out of range')

I am getting the following error (see the stacktrace) when I ran my code in a different GPU (Tesla K-20, cuda 7.5 installed, 6GB memory). Code works fine if I run in GeForce 1080 or Titan X GPU.

Stacktrace:

File "code/source/main.py", line 68, in <module>
    train.train_epochs(train_batches, dev_batches, args.epochs)
  File "/gpfs/home/g/e/geniiexe/BigRed2/code/source/train.py", line 34, in train_epochs
    losses = self.train(train_batches, dev_batches, (epoch + 1))
  File "/gpfs/home/g/e/geniiexe/BigRed2/code/source/train.py", line 76, in train
    self.optimizer.step()
  File "/gpfs/home/g/e/geniiexe/BigRed2/anaconda3/lib/python3.5/site-packages/torch/optim/adam.py", line 70, in step
    bias_correction1 = 1 - beta1 ** state['step']
OverflowError: (34, 'Numerical result out of range')
1 Like

do you have a question?

Yes, what can be the reason to get such error in a different GPU (Tesla K-20) while it works fine in GeForce or Titan X GPU? Moreover what the error means? Is it related to memory overflow which I don’t think so.

The exactly same issue/error i got (with K-40 and cuda 7.5), any suggestions ?

Workaround for this:-

Replace the following lines in adam.py:-

bias_correction1 = 1 - beta1 ** state[‘step’]
bias_correction2 = 1 - beta2 ** state[‘step’]

WITH

bias_correction1 = 1 - beta1 ** min(state[‘step’],1022)
bias_correction2 = 1 - beta2 ** min(state[‘step’],1022)

1 Like

Why this works? What is the reason of taking min between state and 1022?

It is just for numerical reasons, practically beta1 or beta2 are 0.9 (between 0 and 1).
And 0.9^1000 = 1.7478713e-46, which is almost zero for all practical purposes. (You can take 1000 instead of 1022 and nothing will change).

1 Like

So here the state[‘step’] means the number of training step at the time.
The total training step can be calculated as number of epochs * number of batches.
At some point, when step is very big, the calculation of beta1 ** step raises overflow.

For me, changing the Python version solved the issue.

So, with Python 3.7.4, up to a point I get:

>>> 0.9 ** 7073
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: (34, 'Numerical result out of range')

Whereas with Python 3.8 or Python 3.10, I get:

>>> 0.9 ** 7073
0.0