I want to load snapshot from a file on one of the machines running in a distributed setting. From what I see, optimizers aren’t broadcast among machines in such a case. Is there any easy way to do it?
There is no common way to expose optimizer state AFAIK. If you know how you can access the state of your optimizer then you’ll be able to synchronize it by using
torch.distributed collectives directly, e.g.