For curiosity’s sake, I ran a quick test on a machine that I recently bumped up to 3 pascal GPU. Previous comparison was made with 2 x RTX cards.
For imagenet style training @ 224x224, smaller model (something like a mnasnet/mobilenetv2), 8 physical core CPU:
830 img/sec avg - single training process, 3 GPU, torch.nn.DataParallel, 8 (or 9 for fairness) worker processes
1015 img/sec avg - 3 training process, 1 GPU per-proc, apex.DistributedDataParallel, 3 workers per training process
Everything else in those two runs is the same, same preprocessing, using the same ‘fast’ preload + collation routines from Nvidia’s examples. So looks like throwing in another GPU increases the impact to > 20%.