SGE and distributed data parallel training

Hi guys,

Is there any tutorial which shows how can we use distributed model training with SGE (Sun Grid Engine). In general I’m wondering how multiple nodes can communicate with each other in multiple node setup?

Cheers,

I’m not familiar with Sun Grid Engine, but if multiple nodes in the system can talk to each other over TCP, you can follow this tutorial: https://pytorch.org/tutorials/intermediate/dist_tuto.html. You probably want to use the TCP initialization method as described here: https://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods.