Hello,a similar question has been asked here:https://discuss.pytorch.org/t/matplotlib-doesnt-work-in-distributed-training/65724,but no answer,Question can be briefed blow:
In a 4-GPU machine,all gpus are used for training,there is some some code like this:
if is_distributed() and distributed.get_rank()!=0:
print('Only rank_0 will do plotting,this is rank_{}'.format(distributed.get_rank()))
return# in parallel context,single plot is enough
print('this is rank_0 and it will do plotting')
plotAccuracyAndLoss()
When comes to these code,three:
Only rank_0 will do plotting,this is rank_x
got printed out,but
print('this is rank_0 and will do plotting')
never got printed out,and all 4 processes hanged and NO exception got thrown out
watch -n0.1 nvidia-smi
tell that
before these code all, all GPU will have memory usage > 10341MB,
when hitting these lines,the first GPU’s memory usage drops to 2387MB,others remain
previously,I thought that it is the matplotlib which caused this hanging,but right now I found any rank_0-only operation(ploting/checkpointing…) will cause hanging,further more,any rank_x-only operation will cause hanging,so,How to solve this problem?