All_reduce for TorchElastic

joohyunglee · August 22, 2022, 11:47am

TorchElastic code does NOT compute all_reduce in order to yield the performance throughout all gpus (please check the other ImageNet example.) How does TorchElastic gather values from all gpus? Thank you ahead!!

d4l3k · August 23, 2022, 4:52pm

TorchElastic treats the model mostly as a black box. There’s no magic happening here–in that example there’s no sync of the metrics between gpus. If you did want that behavior you should follow do something like the example in the core pytorch repo