The Python script could not exit by itself without Ctrl+C. I try to train a model on 2 GPU in a ubuntu server.
Training and evaluation are fine. But after “main()” finishing, the python process won’t terminate by itself. I checked the gpu( nvidia-smi ) and cpu usage ( htop ). The process still occupied those resources, which means the process group could not be kill by itself. And if I training with only one GPU, it’s fine.
Here’s my opinions:
group_process could not terminated for some reason? Except for initializaiton by torch.distributed.init_process_group(), I didn’t find any solutions in torch Documentation to termanate the group.
Deleting variable like model could not help it out.
Sorry for missing code snippets. This is the first time to create a topic in the community. Here are my codes. And I have already find the potential conflict line:
# import ...
os.environ['CUDA_VISIBLE_DEVICES'] = “0,1”
def main():
train_loader = torch.utils.data.DataLoader(train_dir,
batch_size=args.batch_size,
shuffle=args.shuffle,
**kwargs)
# --------------------------------------------------- confilct line -------------------------------------------
torch.distributed.init_process_group(backend="nccl",
init_method='tcp://localhost:32546',
rank=0,
world_size=1)
# ----------------------------------- line without the problem ---------------------------------------
torch.distributed.init_process_group(backend="nccl", init_method='file:///home/xxx/sharedfile',
rank=0,
world_size=1)
model = DistributedDataParallel(model.cuda())
# train()...
# eval()...
if __name__ == '__main__':
main()