Multi-gpu trainining does not exist correctly


I used the distributed method to train models with multi gpus. When the training process is done, the program does not exist and I can see from nvidia-smi that all the gpus are released except the first gpu which still occupies around 1G memory. I stopped the program by ctrl + C and the memory in the first gpu is thus released.

I can see from the log message after my pressing ctrl + C like this:

File “/miniconda/envs/py36/lib/python3.6/site-packages/torch/distributed/”, line 246, in main
File “/miniconda/envs/py36/lib/python3.6/”, line 1477, in wait
(pid, sts) = self._try_wait(0)
File “/miniconda/envs/py36/lib/python3.6/”, line 1424, in _try_wait
(pid, sts) = os.waitpid(, wait_flags)

What is the cause of this please, and how could I make it work correctly ?

1 Like

Exactly the same issue, have you had any solution so far?

Is this behavior reproducible?
Do you only encounter it in a distributed setup or also with a single GPU?
Could you post a code snippet to reproduce this issue and post some information about your current system and setup?

CC @yikaiw

Yes, it maybe some memory leakage problem about dataloader. I have this problem when I add the line of torch.multiprocessing.set_sharing_strategy('file_system'). If I do not have this line in my code, my program can exit correctly.

1 Like

I only encounter it in a distributed setup. Either running the code on two GPUs or four GPUs meets the same problem. For a single GPU, there is no problem. A quick code to reproduce shows below:

import torch
torch.distributed.init_process_group(backend=“nccl”, init_method=“env://”)

And the running command is: CUDA_VISIBLE_DEVICES=3,4 python -m torch.distributed.launch --nproc_per_node=2

where contains the four-line code.

The problem appears in V100 GPUs. However, the same code is OK when I run it on GeForce GTX-1080 GPUs or RTX-2080 GPUs.

I don’t have this line in my code. :confused: