I have tried this, yes! It gives a pretty good speedup for the test script (and a smaller speedup for my actual code), but not enough to counteract the growth of communications costs. There may be some other comms operations I could make async, though; I’ll definitely look into it.
I’d definitely like to know if there are any environmental variables I can tune for Gloo. I think I’ll need to dig in some more and see if anything like that exists.
The library I’m using on top of PyTorch wraps a model and mostly tries to imitate PyTorch’s API for training. Unfortunately, the only API it currently exposes for updating parameters after backprop is a single function that basically just loops over the model parameters and applies a single step of SGD on them.
I decided after some experimentation that the amount of effort required to either (a) modify the library to be able to wrap DDP or (b) create a modified version of DDP that could wrap the library was nontrivial. I might end up having a flash of inspiration that’ll help me figure out how to do one or the other, but for now it’s not possible to use DDP for my purposes.
Thank you, that’s good to know. If I can’t figure out how to speed up Gloo, I’ll try to see if I can modify my code to use NCCL.