I have been working on implementing distributed training for NER. In the process I implemented a version using Horovod and one using DistributedDataParallel, because I initially thought my issues were related to my implementation. Both work as expected with a public dataset. I can scale the learning rate by the number of processes (or the batch size) and I get results that are very close to the non-distributed training, yet faster. With my private dataset, which served me for testing along the way, the behavior is different. The distributed training on e.g. 4 processes performs almost exactly like when I train on a single process on 1/4 of the data but using the scaled learning rate. Debugging showed that the different processes have different losses and that the gradients are correctly syncronized in the backward pass.
The only two explanations I have for this: 1) There is still something wrong in my code. 2) The gradients computed are so similar for each process that there is not much or no gain in averaging them and the result is similar to working with 1/4 of the data and a scaled learning rate.
This is my first experience with distributed training so I can’t tell if 2) is reasonable and I’d be keen to know more about your experience with this.