Has anyone encountered a similar problem?
The Python script could not exit by itself without Ctrl+C. I try to train a model on 2 GPU in a ubuntu server.
Training and evaluation are fine. But after “main()” finishing, the python process won’t terminate by itself. I checked the gpu( nvidia-smi
) and cpu usage ( htop
). The process still occupied those resources, which means the process group could not be kill by itself. And if I training with only one GPU, it’s fine.
Here’s my opinions:
-
group_process
could not terminated for some reason? Except for initializaiton by torch.distributed.init_process_group()
, I didn’t find any solutions in torch Documentation to termanate the group.
- Deleting variable like model could not help it out.
- Is this a bug for torch or python ?
- Did I make mistakes in using multi-gpu?
code snippets?
also, why not use init_process_group()?
From what I remember you have to destroy the distributed groups, so at the end of your code:
import torch.distributed as dist
dist.destroy_process_group()
Thanks for your advice! 
I tried destroy_process_group()
but it didn’t work.
In the end, I fixed it by deleting the following Potential conflict line:
torch.multiprocessing.set_sharing_strategy('file_system')
It seems that this line is compatible with
torch.distributed.init_process_group(backend="nccl", init_method='file:///home/xxx//sharedfile', rank=0, world_size=1)
instead of
torch.distributed.init_process_group(backend="nccl", init_method='tcp://localhost:32546', rank=0, world_size=1)
and it may cause the problems.
Sorry for missing code snippets. This is the first time to create a topic in the community. Here are my codes. And I have already find the potential conflict line:
# import ...
os.environ['CUDA_VISIBLE_DEVICES'] = “0,1”
def main():
train_loader = torch.utils.data.DataLoader(train_dir,
batch_size=args.batch_size,
shuffle=args.shuffle,
**kwargs)
# --------------------------------------------------- confilct line -------------------------------------------
torch.distributed.init_process_group(backend="nccl",
init_method='tcp://localhost:32546',
rank=0,
world_size=1)
# ----------------------------------- line without the problem ---------------------------------------
torch.distributed.init_process_group(backend="nccl", init_method='file:///home/xxx/sharedfile',
rank=0,
world_size=1)
model = DistributedDataParallel(model.cuda())
# train()...
# eval()...
if __name__ == '__main__':
main()
Thanks, anyway! 