CUDA out of memory error

Kalamaya · January 27, 2017, 11:07pm

I ran my net with a large minibatch on the GPU with problem, and then ctrl-c out of it. However when I try to re-run my script I get:

RuntimeError: cuda runtime error (2) : out of memory at /home/soumith/local/builder/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu:66

Nothing changed and it just worked before. Is the memory on the GPU not being released? Is there a way to force the GPU memory to release its memory before a run?

Thanks.

apaszke · January 27, 2017, 11:13pm

What is reported by nvidia-smi? Are you using multiprocess data loading?

Kalamaya · January 27, 2017, 11:15pm

nvidia-smi shows me using 8GB out of the 12 GB…

What puzzles me is that literally just stopped with a large minibatch size that training just fine, but then when I re-start the python script it craps out.

I am not using multiprocess data loading to my knowledge, I am using the pyTorch torch.utils.data.DataLoader basically, where I have set number of workers to 2.

apaszke · January 27, 2017, 11:18pm

DataLoader with 2 worokers will spawn 2 subprocesses, so you’re using it. There’s a problem with Python’s multiprocessing where it doesn’t always clean up the child processes properly. If you don’t have any other python jobs running and it’s your private computer you might try killall python, if not you have to look for the worker processes and kill them…

Kalamaya · January 27, 2017, 11:20pm

Yes, its my private computer - ok, I’ll try that and hope I dont break anything… will report back in a bit… fingers crossed

Kalamaya · January 28, 2017, 12:00am

@smth @apaszke This is very strange - there are literally no other python processes, nvidia-smi shows near 0% perfect GPU-RAM usage, and yet I get the same error… it’s also non-deterministic, as I keep trying the same command, I either luck out or not…

mjchen611 · September 9, 2017, 2:27am

Hi,

In the traing of CNN net work, I use multi subprocesses to load data(num_workers =8) and with the increase of epoch,I notice that the (RAM, but not GPU) memory increases.
I thought may be I can kill subprocesses after a few of epochs and then reset new subprocesses to continue train the network,but I don’t know how to kill the subprocesses in the main processes.

Can you give me some suggestions?

Thank you so much.

mjchen611 · September 9, 2017, 2:38am

And when I set num_workers = 0,the (RAM, but not GPU) memory not increases largely with the increase of epoch…

Can you give me some suggestions or instructions about the problem?

Thank you so much.

Harikrishna.Vydana · October 4, 2017, 3:33pm

i face the same error…is there a way to fin the variables which are getting cached on gpu