Multi-gpu trainining does not exist correctly

coincheung · January 10, 2020, 7:15am

Hi,

I used the distributed method to train models with multi gpus. When the training process is done, the program does not exist and I can see from nvidia-smi that all the gpus are released except the first gpu which still occupies around 1G memory. I stopped the program by ctrl + C and the memory in the first gpu is thus released.

I can see from the log message after my pressing ctrl + C like this:

File “/miniconda/envs/py36/lib/python3.6/site-packages/torch/distributed/launch.py”, line 246, in main
process.wait()
File “/miniconda/envs/py36/lib/python3.6/subprocess.py”, line 1477, in wait
(pid, sts) = self._try_wait(0)
File “/miniconda/envs/py36/lib/python3.6/subprocess.py”, line 1424, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

What is the cause of this please, and how could I make it work correctly ?

yikaiw · January 15, 2020, 1:55pm

Exactly the same issue, have you had any solution so far?

ptrblck · January 16, 2020, 5:53am

Is this behavior reproducible?
Do you only encounter it in a distributed setup or also with a single GPU?
Could you post a code snippet to reproduce this issue and post some information about your current system and setup?

CC @yikaiw

coincheung · January 17, 2020, 3:17am

Yes, it maybe some memory leakage problem about dataloader. I have this problem when I add the line of torch.multiprocessing.set_sharing_strategy('file_system'). If I do not have this line in my code, my program can exit correctly.

yikaiw · January 17, 2020, 8:15am

I only encounter it in a distributed setup. Either running the code on two GPUs or four GPUs meets the same problem. For a single GPU, there is no problem. A quick code to reproduce shows below:

import torch
torch.cuda.set_device(0)
torch.distributed.init_process_group(backend=“nccl”, init_method=“env://”)
torch.distributed.barrier()

And the running command is: CUDA_VISIBLE_DEVICES=3,4 python -m torch.distributed.launch --nproc_per_node=2 quick_code.py

where quick_code.py contains the four-line code.

The problem appears in V100 GPUs. However, the same code is OK when I run it on GeForce GTX-1080 GPUs or RTX-2080 GPUs.

yikaiw · January 17, 2020, 8:29am

I don’t have this line in my code.