Dynamic Node assigning in DDP

I have few servers with different configurations to run a DDP training job, I want to take into account also the networking aspects so If the one of the servers wasting the time and makes the training not better even worse so I want to do something dynamic to evaluate the impact of a node in the training with respect to its speed and affecting the training time so if it has negative impact it should dynamically removed from the job. is there any way to this automatically?

  • I use torchrun to run my code.