Training hangs if any specific rank process start an other process to do anything

Hello,a similar question has been asked here:https://discuss.pytorch.org/t/matplotlib-doesnt-work-in-distributed-training/65724,but no answer,Question can be briefed blow:

In a 4-GPU machine,all gpus are used for training,there is some some code like this:

 if is_distributed() and distributed.get_rank()!=0:
             print('Only rank_0 will do plotting,this is rank_{}'.format(distributed.get_rank()))
             return# in parallel context,single plot is enough
print('this is rank_0 and  it will do plotting')
plotAccuracyAndLoss()

When comes to these code,three:
Only rank_0 will do plotting,this is rank_x

got printed out,but

print('this is rank_0 and will do plotting')

never got printed out,and all 4 processes hanged and NO exception got thrown out

watch -n0.1 nvidia-smi tell that

before these code all, all GPU will have memory usage > 10341MB,
when hitting these lines,the first GPU’s memory usage drops to 2387MB,others remain

previously,I thought that it is the matplotlib which caused this hanging,but right now I found any rank_0-only operation(ploting/checkpointing…) will cause hanging,further more,any rank_x-only operation will cause hanging,so,How to solve this problem?

After adding:
distributed.barriere()
before any rank_x specific operation,everything goes fine.