Same batch size in evaluation trigger "RuntimeError: CUDA out of memory."

songyuc · February 25, 2020, 1:28pm

Hi, guys,
I am learning about DeepLabV3+ model these days.
And I meet a strange phenomenon that using the same batch size in evaluation trigger “RuntimeError: CUDA out of memory.”, which is normal in training.
But the inference speed seems quite faster than the training.

Any idea or answer will be appreciated!

ptrblck · February 27, 2020, 4:07am

Are you seeing the OOM error after a few iterations or how were you able to see the validation step being faster?
Since Python uses function scoping, you might want to wrap some code in specific functions so that they’ll be cleared as explained here.

songyuc · February 27, 2020, 10:01am

Hi, I saw this OOM error as soon as the program started.

ptrblck · February 27, 2020, 6:56pm

What is the largest batch size you can run in training and evaluation without running into the OOM error?

songyuc · February 28, 2020, 9:27am

The largest batch size in training and evaluation is 4 and 2.

ptrblck · February 29, 2020, 5:00am

Could you post a code snippet to reproduce this issue?
Instead of your real data, you could initialize the input and target using random tensors, so that we could debug this issue.

songyuc · March 11, 2020, 10:53am

I think should check the problem myself first,
and I want to know if there is way to show the GPU memory usage of a PyTorch model?

ptrblck · March 11, 2020, 4:57pm

You could use torch.cuda.memory_allocated(), torch.cuda.memory_cached() etc. in your script to check the memory. Also, nvidia-smi will give you the overall memory usage (including the CUDA context).

songyuc · March 12, 2020, 6:19am

Thanks for your answer.
I will try it next time.