I used the distributed method to train models with multi gpus. When the training process is done, the program does not exist and I can see from nvidia-smi that all the gpus are released except the first gpu which still occupies around 1G memory. I stopped the program by ctrl + C and the memory in the first gpu is thus released.
I can see from the log message after my pressing ctrl + C like this:
File “/miniconda/envs/py36/lib/python3.6/site-packages/torch/distributed/launch.py”, line 246, in main
process.wait()
File “/miniconda/envs/py36/lib/python3.6/subprocess.py”, line 1477, in wait
(pid, sts) = self._try_wait(0)
File “/miniconda/envs/py36/lib/python3.6/subprocess.py”, line 1424, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
What is the cause of this please, and how could I make it work correctly ?
Is this behavior reproducible?
Do you only encounter it in a distributed setup or also with a single GPU?
Could you post a code snippet to reproduce this issue and post some information about your current system and setup?
Yes, it maybe some memory leakage problem about dataloader. I have this problem when I add the line of torch.multiprocessing.set_sharing_strategy('file_system'). If I do not have this line in my code, my program can exit correctly.
I only encounter it in a distributed setup. Either running the code on two GPUs or four GPUs meets the same problem. For a single GPU, there is no problem. A quick code to reproduce shows below: