I’m asking because I wonder why .zero_grad() would cause memory out. From my understanding this op is just to set param.grad.data to zero, why extra memory would be required?
That’s strange. I can’t think of why that would happen.
Is there a small script you can give to reproduce this? I am happy to look into what’s happening.
I may need a while to reduce the code as small as possible. The error trace back is as follows:
THCudaCheck FAIL file=/data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory Warning: out of memory Warning: out of memory epoch = 0, loss =2.78738046, ER_train = 100.00, ER_batch = 100.00, time = 2.90s(2.90|0.00), progress = 0.00%, time remained = 1781.43h epoch = 0, loss =2.77562714, ER_train = 98.44, ER_batch = 96.88, time = 0.73s(0.73|0.00), progress = 0.00%, time remained = 1983.91h epoch = 0, loss =2.74634695, ER_train = 97.40, ER_batch = 95.31, time = 1.40s(1.40|0.00), progress = 0.04%, time remained = 5.93h Warning: out of memory Traceback (most recent call last): File "DIC_train_pytorch.py", line 397, in <module> optimizer.zero_grad() File "/home/David/App/anaconda3/lib/python3.5/site-packages/torch/optim/optimizer.py", line 136, in zero_grad param.grad.data.zero_() RuntimeError: cuda runtime error (2) : out of memory at /data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:35
In the above trace back logs, “Warning: out of memory” is printed by my code to warn me that an out of memory exception (exactly the exception as shown in the last line of the above log) is catched. This exception would be raised by pytorch when input train data batch is big. After catching the exception, I’ll reduce the batch size and try the training procedure again. The corresponding code snippet is as
optimizer.zero_grad() try: if device >= 0: score = model(Variable(torch.from_numpy(X)).cuda(device)) else: score = model(Variable(torch.from_numpy(X))) except RuntimeError as e: if e.args.startswith('cuda runtime error (2) : out of memory'): print('Warning: out of memory') cached_data.extend(split_train_data([X, Y])) continue else: raise e
it’s possible that OOM occurs elsewhere but is reported at
Run your program with:
CUDA_LAUNCH_BLOCKING=1 python script.py
and see if it still reports the OOM at
The track back is the same, though.
What's the best way to handle exception "cuda runtime error (2) : out of memory"?
We’re probably missing a check somewhere so the error pops up only there. You’re likely working under a super heavy memory pressure, and the model doesn’t fit. What’s the last operation you do (loss fn + last op before)? Did you try reducing the batch size?
I have the same error.
Tried running the model with CUDA_LAUNCH_BLOCKING=1 but still the error pops up at optimizer.zero_grad(). Can anyone help me out? I can post the model and training snippet if needed.