Is dist.all_reduce block or not, or is there memory leakage or others

Hi,

thanks for developing pytorch for us!!

I am using two gpus to train my model, and I use torch.distribtued to manepulate them. After training, at the end of the program, I used a line of dist.all_reduce(hist, dist.ReduceOp.SUM) to merge the tensors on two gpus.

My problem is that: The program seems to continue executing after this line and I could see the following log messages from the rank=0 process. But after the last log message is printed to the screen, the program does not cease. I can see from the nvidia-smi that the memory on the second gpu is released and the process on that gpu is cleaned, but the memory on the first gpu is not totally cleaned, with around 2G still remained in the gpu(the gpu usage ratio is 0%). The associated lines in the output of ps aux is like this:

yz    330577  0.0  0.0 2945348 51632 ?       S    May12   0:00 /mnt/ai-vision/home/yz/miniconda3/envs/py37/bin/python -m torch.distributed.launch --nproc_per_node=2 train.py
yz    330599  354  0.8 30957028 2308304 ?    Sl   May12 3176:02 /mnt/ai-vision/home//miniconda3/envs/py37/bin/python -u train.py --local_rank=0

What is the cause of this please?