Hello World aka. MNIST with feed forward gets less accuracy in comparison of plain with DistributedDataParallel (DDP) model with only one node

Dear PyTorch Community,

When rewriting my library for training I stumbled upon the following problem: When I train with DistributedDataParallel (DDP) I get less accuracy than without. Especially if I limit DDP to only one node and one gpu I would expect that DDP and non-DDP gives approximately the same result given the same code and Hyperparameters. It does not for me.

I distilled the problem in two examples of the Hello World for DeepLearning: Training a super simple FeedForward Network on MNIST. It has exactly the same hyperparemeters for distributed and not distributed training and generally the same code beside the adoptions needed to use the DDP library. Also all seeds are fixed to make it reproducible. The most astonishing things is: When I train the DDP model with only one node it still gives about 10% less accuracy. I tried this with different pytorch versions (1.4, 1.8.1 and 1.9.0) on four different computers (MacBook Pro on CPU, three different ubuntu machines, one with a 1080-ti, one with a 2080-ti and a Cluster with P100s inside). So no matter os OS or GPU or CPU when I limit DDP to one node only and compare it to the plain training I get an extremely worse accuracy. Also if I f.e. use two nodes and double the learning rate with the same batch size it gives me this worse results.

Now I wonder how that could be. I can not imagine I am the only one to stumble upon that problem, but if it is a user error, than it would be cool to see what it is. I attached a link to a GitHub repo with both scripts.

This is a link to the repository with the example: GitHub - joergsimon/mnist-distributed-problem: This is a super small repository demonstrating a Problem with DistributedDataParallel. A three layer feed forward neural network is trained with MNIST with and without data parallel with the same hyper parameters. If you configure DistributedDataParalell to use only one node, the model is quite worse in accuracy. If you have any suggestions how to make them equal beside tuning the learning rate please commend or send PR!

Any suggestions would be awesome!

someone answered the question on stack overflow. Basically over trying things out apparently a different batch_size slipped into the two different versions. Correcting that gives the same result… pytorch - Hello World aka. MNIST with feed forward gets less accuracy in comparison of plain with DistributedDataParallel (DDP) model with only one node - Stack Overflow