RuntimeError: CUDA out of memory + gpu memory management best practices? + colab

Prachi · July 14, 2022, 4:32am

Hello all, can anyone explain to me what is happening here, I am a novice in this topic of memory management + colab:
I am training a pre-trained inception V3 model for cifar10 data for epochs=10 on a colab.
Following is higher level script code:

for each epoch:
model.train()
…
model.eval()
with torch.no_grad():
…
…

After the first epoch of training colab throws error related to RuntimeError: CUDA out of memory for batch_size:[8,16,32,64], just before running evaluation on test.
How should I fix it if I want to run in one script? What are the best practices for this, memory mgmt? Answers would be appraciated.

Thanks,
Prachi

thecho7 · July 14, 2022, 4:35am

YOLOv5 supports what you are looking for (yolov5/autobatch.py at master · ultralytics/yolov5 · GitHub)

Prachi · July 14, 2022, 4:46am

Hi Suho, thanks for your prompt reply.
The code provides estimating apt batch size to use fraction of available CUDA memory, probably to avoid running OOM. is it right? It is helpful in a way. However, do you know if in a script I can run training+evaluation subsequently for each epoch?
Prachi

ptrblck · July 14, 2022, 5:02am

If the validation loop raises the out of memory error, you are either using too much memory in the validation loop directly (e.g. the validation batch size might be too large) or you are holding references to the previously executed training run.
Python frees variables once they leave a function scope so you could either wrap the training and validation on own functions or del unneeded tensors manually if you are using a global scope.

Prachi · July 14, 2022, 8:05pm

Hi @ptrblck, big fan of your answers on PyTorch.

Yes, you are right, I overlooked the validation batch size, which is 1000 in my case. For training, I am using 64 (the current GPU can handle), so big difference. I tried running with test_batch_size = 250 and it worked fine.
What suggestions would you recommend other than making train and eval functions separately?

Thanks,
Prachi

ptrblck · July 14, 2022, 9:16pm

Besides making sure unneeded references are freed, I would check if the validation loop is already wrapped into a with torch.no_grad() guard to avoid storing intermediates.

Prachi · July 15, 2022, 3:11am

Yes, that helps. Thanks!!