Multi-GPU training to multi-node training

Rakshith_V · May 24, 2023, 6:09am

If I have a training script which works well for multi-GPU training . What changes in the training script I should make to convert into multi-node training. (Aware about changes in launch command)

I have used local rank in my training script, should this be changed for multi-node training ?
model= DDP(model, device_ids=[local_rank], output_device=local).
Is this correct for multi-gPU training and what should be changed for multi-node training ?

Manuel_Alejandro_Dia · May 24, 2023, 2:57pm

Please post some code to give context to your question.

In brief, for any DDP application you will need to specify the world_size, this is usually given by your number of GPUs if you are on one machine.

If you are using multiple machines, the world size would be determined by: num_machones * gpus , assumming that you have the same number of GPUs in each machine.

Have a look at this Medium post where they explain what to do when using multiple nodes.