thanks for developing pytorch for us!!
I am using two gpus to train my model, and I use
torch.distribtued to manepulate them. After training, at the end of the program, I used a line of
dist.all_reduce(hist, dist.ReduceOp.SUM) to merge the tensors on two gpus.
My problem is that: The program seems to continue executing after this line and I could see the following log messages from the
rank=0 process. But after the last log message is printed to the screen, the program does not cease. I can see from the
nvidia-smi that the memory on the second gpu is released and the process on that gpu is cleaned, but the memory on the first gpu is not totally cleaned, with around 2G still remained in the gpu(the gpu usage ratio is 0%). The associated lines in the output of
ps aux is like this:
yz 330577 0.0 0.0 2945348 51632 ? S May12 0:00 /mnt/ai-vision/home/yz/miniconda3/envs/py37/bin/python -m torch.distributed.launch --nproc_per_node=2 train.py yz 330599 354 0.8 30957028 2308304 ? Sl May12 3176:02 /mnt/ai-vision/home//miniconda3/envs/py37/bin/python -u train.py --local_rank=0
What is the cause of this please?