Process group doesn't terminate by itself?

Has anyone encountered a similar problem?

The Python script could not exit by itself without Ctrl+C. I try to train a model on 2 GPU in a ubuntu server.

Training and evaluation are fine. But after “main()” finishing, the python process won’t terminate by itself. I checked the gpu( nvidia-smi ) and cpu usage ( htop ). The process still occupied those resources, which means the process group could not be kill by itself. And if I training with only one GPU, it’s fine.

Here’s my opinions:

  • group_process could not terminated for some reason? Except for initializaiton by torch.distributed.init_process_group(), I didn’t find any solutions in torch Documentation to termanate the group.
  • Deleting variable like model could not help it out.
  • Is this a bug for torch or python ?
  • Did I make mistakes in using multi-gpu?

code snippets?

also, why not use init_process_group()?

From what I remember you have to destroy the distributed groups, so at the end of your code:

import torch.distributed as dist
dist.destroy_process_group()

Thanks for your advice! :grimacing:
I tried destroy_process_group() but it didn’t work.

In the end, I fixed it by deleting the following Potential conflict line:
torch.multiprocessing.set_sharing_strategy('file_system')

It seems that this line is compatible with

torch.distributed.init_process_group(backend="nccl", init_method='file:///home/xxx//sharedfile', rank=0, world_size=1)

instead of

torch.distributed.init_process_group(backend="nccl", init_method='tcp://localhost:32546', rank=0, world_size=1)

and it may cause the problems.

Sorry for missing code snippets. This is the first time to create a topic in the community. Here are my codes. And I have already find the potential conflict line:

# import ...
os.environ['CUDA_VISIBLE_DEVICES'] = “0,1”


def main():
    train_loader = torch.utils.data.DataLoader(train_dir,
                                                                       batch_size=args.batch_size, 
                                                                       shuffle=args.shuffle,
                                                                        **kwargs)
    
    # --------------------------------------------------- confilct line -------------------------------------------
    torch.distributed.init_process_group(backend="nccl",
                                                              init_method='tcp://localhost:32546',
                                                              rank=0, 
                                                              world_size=1)

    # -----------------------------------    line without the problem   ---------------------------------------
    torch.distributed.init_process_group(backend="nccl", init_method='file:///home/xxx/sharedfile',
                                                              rank=0,
                                                              world_size=1)

    model = DistributedDataParallel(model.cuda())

    # train()...
    # eval()...

if __name__ == '__main__':
    main()

Thanks, anyway! :grimacing: