Out of memory when optimizer.zero_grad() is called


(David Leon) #1

I’m asking because I wonder why .zero_grad() would cause memory out. From my understanding this op is just to set param.grad.data to zero, why extra memory would be required?


(colesbury) #2

That’s strange. I can’t think of why that would happen.


#3

Is there a small script you can give to reproduce this? I am happy to look into what’s happening.


(David Leon) #4

I may need a while to reduce the code as small as possible. The error trace back is as follows:

THCudaCheck FAIL file=/data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Warning: out of memory
Warning: out of memory

epoch = 0, loss =2.78738046, ER_train = 100.00, ER_batch = 100.00, time = 2.90s(2.90|0.00), progress = 0.00%, time remained = 1781.43h
epoch = 0, loss =2.77562714, ER_train = 98.44, ER_batch = 96.88, time = 0.73s(0.73|0.00), progress = 0.00%, time remained = 1983.91h
epoch = 0, loss =2.74634695, ER_train = 97.40, ER_batch = 95.31, time = 1.40s(1.40|0.00), progress = 0.04%, time remained = 5.93h

Warning: out of memory

Traceback (most recent call last):
  File "DIC_train_pytorch.py", line 397, in <module>
    optimizer.zero_grad()
  File "/home/David/App/anaconda3/lib/python3.5/site-packages/torch/optim/optimizer.py", line 136, in zero_grad
    param.grad.data.zero_()
RuntimeError: cuda runtime error (2) : out of memory at /data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:35

In the above trace back logs, “Warning: out of memory” is printed by my code to warn me that an out of memory exception (exactly the exception as shown in the last line of the above log) is catched. This exception would be raised by pytorch when input train data batch is big. After catching the exception, I’ll reduce the batch size and try the training procedure again. The corresponding code snippet is as

  optimizer.zero_grad()
    try:
        if device >= 0:
            score = model(Variable(torch.from_numpy(X)).cuda(device))
        else:
            score = model(Variable(torch.from_numpy(X)))
    except RuntimeError as e:
        if e.args[0].startswith('cuda runtime error (2) : out of memory'):
            print('Warning: out of memory')
            cached_data.extend(split_train_data([X, Y]))
            continue
        else:
            raise e

#5

it’s possible that OOM occurs elsewhere but is reported at zero_grad.
Run your program with:

CUDA_LAUNCH_BLOCKING=1 python script.py

and see if it still reports the OOM at zero_grad.


(David Leon) #6

The track back is the same, though.


What's the best way to handle exception "cuda runtime error (2) : out of memory"?
(Adam Paszke) #7

We’re probably missing a check somewhere so the error pops up only there. You’re likely working under a super heavy memory pressure, and the model doesn’t fit. What’s the last operation you do (loss fn + last op before)? Did you try reducing the batch size?


(Aditya Sarma) #8

I have the same error.

Tried running the model with CUDA_LAUNCH_BLOCKING=1 but still the error pops up at optimizer.zero_grad(). Can anyone help me out? I can post the model and training snippet if needed.

Thanks