Dear PyTorch Community,
When rewriting my library for training I stumbled upon the following problem: When I train with DistributedDataParallel (DDP) I get less accuracy than without. Especially if I limit DDP to only one node and one gpu I would expect that DDP and non-DDP gives approximately the same result given the same code and Hyperparameters. It does not for me.
I distilled the problem in two examples of the Hello World for DeepLearning: Training a super simple FeedForward Network on MNIST. It has exactly the same hyperparemeters for distributed and not distributed training and generally the same code beside the adoptions needed to use the DDP library. Also all seeds are fixed to make it reproducible. The most astonishing things is: When I train the DDP model with only one node it still gives about 10% less accuracy. I tried this with different pytorch versions (1.4, 1.8.1 and 1.9.0) on four different computers (MacBook Pro on CPU, three different ubuntu machines, one with a 1080-ti, one with a 2080-ti and a Cluster with P100s inside). So no matter os OS or GPU or CPU when I limit DDP to one node only and compare it to the plain training I get an extremely worse accuracy. Also if I f.e. use two nodes and double the learning rate with the same batch size it gives me this worse results.
Now I wonder how that could be. I can not imagine I am the only one to stumble upon that problem, but if it is a user error, than it would be cool to see what it is. I attached a link to a GitHub repo with both scripts.
Any suggestions would be awesome!