Slow distributed training

My network is 1 Gbit ethernet and i am trying to use pytorch distributed training on two 8-gpu servers. Training procedure is simple classification objective with feed-forward network. I experience significant slowdown in comparison with single 8-gpu server training. Also “nload” tool shows full bandwidth usage even for small model (resnet18).

Is my network too slow for distributed training? If it is, what bandwidth (in Gbit/s) do I need to train heavy models like resnet101?

You can figure it out from your gradient size (could just take size of your model checkpoint) + step time. For instance Resnet50 160ms per batch, 50MB checkpoint, therefore each worker needs to send and receive 50/.16 = 312 MB per second, means you need >=2.5 Gbps

What matters here is the ratio of compute time to parameter size. If you double computation + double parameter size, the network requirement is unaffected. Conv nets have good ratio of compute/bandwidth, transformers will need more bandwidth because of matmuls

1 Like