Unequal GPU amount in nodes for distributed training

Hypothetically, if I have 2 GPUs in node 0 and 3 GPUs in node 1, how would I configure it to support that? All the examples in the documentation as well as example codes perform word_size = gpus_per_node * args.world_size, which assumes from gpus_per_node that there is an equivalent amount of GPUs per node.

That expectation is built into the torch.distributed.launch utility but not elsewhere. You can start 5 processes (1 per GPU) and use world_size=5 where you have 2 processes on one machine and 3 processes on the other machine. It’s not very common to have this situation, so I’m not surprised most of the examples you see assume a symmetric contribution across machines. That said, you can still make it work, but will have to adapt those examples or start from scratch with torch.nn.parallel.DistributedDataParallel.

1 Like