How to use different number of gpus on 2 nodes?

How do I train on two machines, one with 4 gpus and one with 8 gpus? I find that the documentation of torch elastic run (torchrun (Elastic Launch) — PyTorch 1.10.0 documentation) suggests

" 4. This module only supports homogeneous LOCAL_WORLD_SIZE. That is, it is assumed that all nodes run the same number of local workers (per role)."

Thanks!

cc @Kiuk_Chung

As of today DDP does not officially support running jobs with different number of GPUs on different machines. One workaround might be to use the CUDA_VISIBLE_DEVICES environment variable to split 8 GPUs into two sets and then launch two DDP processes.

1 Like