DataParallel: Different performance depending on the GPU Type

I have two nodes that have 4 GPUs each, the one having 4x Nvidia A100, the other having 4x Nvidia A40.

When performing exactly the same Neural Network Training on both nodes individually, the Nvidia A100 run is performing way better, despite all settings (batch size, learning rate, etc) being the same (See the graphic below.). I am using DataParallel since DistributedDataParallel is not an option currently. I verified this using multiple different initialization seeds and the pattern is clear: runs on the A100 node clearly outperform the runs on the A40 node. For some settings, the A40 runs were not even able to significantly change the loss.

How is this possible that the compute infrastructure is playing such a huge role? What kind of bug am I missing here?

Update: When restricting the number of GPUs per run to 2 instead of 4, all runs seem to work fine. Whats the catch here? Why is Pytorch failing when using 4xA40 at the same time?

Are you seeing any other issues in your current A40 setup if all GPUs are used e.g. by running the nccl-tests? I’m wondering if the communication itself fails between all 4 devices as it seems that 2 devices work fine.

Thanks for the answer. I haven’t tested nccl-tests yet and will check out if I can get it to work. Any particular test configurations that should be interesting in this case?

Apart from that, I was using these A40-nodes without DataParallel without problems, mostly using all 4 at once where each GPU had a single or two individual runs going (e.g. no communication between the GPUs was necessary).

No, I would just run them all and check a) if mismatches are detected and b) if the perf. is expected.