I ran my net with a large minibatch on the GPU with problem, and then ctrl-c out of it. However when I try to re-run my script I get:
RuntimeError: cuda runtime error (2) : out of memory at /home/soumith/local/builder/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu:66
Nothing changed and it just worked before. Is the memory on the GPU not being released? Is there a way to force the GPU memory to release its memory before a run?
What puzzles me is that literally just stopped with a large minibatch size that training just fine, but then when I re-start the python script it craps out.
I am not using multiprocess data loading to my knowledge, I am using the pyTorch torch.utils.data.DataLoader basically, where I have set number of workers to 2.
DataLoader with 2 worokers will spawn 2 subprocesses, so you’re using it. There’s a problem with Python’s multiprocessing where it doesn’t always clean up the child processes properly. If you don’t have any other python jobs running and it’s your private computer you might try killall python, if not you have to look for the worker processes and kill them…
@smth@apaszke This is very strange - there are literally no other python processes, nvidia-smi shows near 0% perfect GPU-RAM usage, and yet I get the same error… it’s also non-deterministic, as I keep trying the same command, I either luck out or not…
In the traing of CNN net work, I use multi subprocesses to load data(num_workers =8) and with the increase of epoch,I notice that the (RAM, but not GPU) memory increases.
I thought may be I can kill subprocesses after a few of epochs and then reset new subprocesses to continue train the network,but I don’t know how to kill the subprocesses in the main processes.