If I have a training script which works well for multi-GPU training . What changes in the training script I should make to convert into multi-node training. (Aware about changes in launch command)
I have used local rank in my training script, should this be changed for multi-node training ?
model= DDP(model, device_ids=[local_rank], output_device=local).
Is this correct for multi-gPU training and what should be changed for multi-node training ?
Please post some code to give context to your question.
In brief, for any DDP application you will need to specify the
world_size, this is usually given by your number of GPUs if you are on one machine.
If you are using multiple machines, the world size would be determined by:
num_machones * gpus , assumming that you have the same number of GPUs in each machine.
Have a look at this Medium post where they explain what to do when using multiple nodes.