Process group doesn't terminate by itself?

Wenhao-Yang · April 16, 2021, 10:52am

Has anyone encountered a similar problem?

The Python script could not exit by itself without Ctrl+C. I try to train a model on 2 GPU in a ubuntu server.

Training and evaluation are fine. But after “main()” finishing, the python process won’t terminate by itself. I checked the gpu( nvidia-smi ) and cpu usage ( htop ). The process still occupied those resources, which means the process group could not be kill by itself. And if I training with only one GPU, it’s fine.

Here’s my opinions:

group_process could not terminated for some reason? Except for initializaiton by torch.distributed.init_process_group(), I didn’t find any solutions in torch Documentation to termanate the group.
Deleting variable like model could not help it out.
Is this a bug for torch or python ?
Did I make mistakes in using multi-gpu?

Yanli_Zhao · April 16, 2021, 5:24pm

code snippets?

also, why not use init_process_group()?

BramVanroy · April 16, 2021, 5:49pm

From what I remember you have to destroy the distributed groups, so at the end of your code:

import torch.distributed as dist
dist.destroy_process_group()

Wenhao-Yang · April 17, 2021, 1:56am

Thanks for your advice!
I tried destroy_process_group() but it didn’t work.

In the end, I fixed it by deleting the following Potential conflict line:
torch.multiprocessing.set_sharing_strategy('file_system')

It seems that this line is compatible with

torch.distributed.init_process_group(backend="nccl", init_method='file:///home/xxx//sharedfile', rank=0, world_size=1)

instead of

torch.distributed.init_process_group(backend="nccl", init_method='tcp://localhost:32546', rank=0, world_size=1)

and it may cause the problems.

Wenhao-Yang · April 17, 2021, 2:13am

Sorry for missing code snippets. This is the first time to create a topic in the community. Here are my codes. And I have already find the potential conflict line:

# import ...
os.environ['CUDA_VISIBLE_DEVICES'] = “0,1”


def main():
    train_loader = torch.utils.data.DataLoader(train_dir,
                                                                       batch_size=args.batch_size, 
                                                                       shuffle=args.shuffle,
                                                                        **kwargs)
    
    # --------------------------------------------------- confilct line -------------------------------------------
    torch.distributed.init_process_group(backend="nccl",
                                                              init_method='tcp://localhost:32546',
                                                              rank=0, 
                                                              world_size=1)

    # -----------------------------------    line without the problem   ---------------------------------------
    torch.distributed.init_process_group(backend="nccl", init_method='file:///home/xxx/sharedfile',
                                                              rank=0,
                                                              world_size=1)

    model = DistributedDataParallel(model.cuda())

    # train()...
    # eval()...

if __name__ == '__main__':
    main()

Thanks, anyway!