I train my model across several machines. I have two machines which have gpus and infiniband cards. The networks is 1Gbit, Infiniband is 2x40Gbit. When I remove cards, and start training everything works, though slower than on one machine. When I run with infiniband setup, the system just hangs. There’s 100% GPU utilisation, wattage is 1/2 of maximum, and there’s very little network activity.
Do you have any hints on how to proceed with finding out what’s wrong with training?