I have two nodes that have 4 GPUs each, the one having 4x Nvidia A100, the other having 4x Nvidia A40.
When performing exactly the same Neural Network Training on both nodes individually, the Nvidia A100 run is performing way better, despite all settings (batch size, learning rate, etc) being the same (See the graphic below.). I am using DataParallel since DistributedDataParallel is not an option currently. I verified this using multiple different initialization seeds and the pattern is clear: runs on the A100 node clearly outperform the runs on the A40 node. For some settings, the A40 runs were not even able to significantly change the loss.
How is this possible that the compute infrastructure is playing such a huge role? What kind of bug am I missing here?
Update: When restricting the number of GPUs per run to 2 instead of 4, all runs seem to work fine. Whats the catch here? Why is Pytorch failing when using 4xA40 at the same time?