In order to get the precise result on test dataset, I use the following code:
val_stat = self.evaluate() if utils.is_main_process() else None
dist.barrier()
update(val_stat)
And it got stuck.
In order to get the precise result on test dataset, I use the following code:
val_stat = self.evaluate() if utils.is_main_process() else None
dist.barrier()
update(val_stat)
And it got stuck.
Hey @ojipadeson, is there any collective communication in self.evaluate()
? All processes must launch the same collective communication in the same order. If you are not sure if self.evaluate()
launched any collective, you can try remove val_stat = self.evaluate() if utils.is_main_process() else None
and see if running dist.barrier()
alone passes?