Help running PyTorch on HPC - Clusters

padamsethia · September 24, 2017, 3:35pm

So I dont have access to a GPU ,but have access to a 25 cluster of Xeon CPUs . Here are my questions :

If I set the number of nodes in qpub to 24 , will PyTorch automatically use all the nodes , because someone earlier said that it relies on Intel MKL and MKL should be able to detect all the CPUs available.
Also , does setting the num_workers take care of automatically distributing the workload ?

smth · September 28, 2017, 3:47am

pytorch will use all the cores on a single machine.
If you want to use 25 Xeon machines, then you will have to write some special logic using our torch.distributed functions: http://pytorch.org/docs/master/distributed.html

padamsethia · October 7, 2017, 10:20pm

Given that the torch.distributed module is still in beta, does torch.multiprocessing have any disadvantge performance wise other than running each script separately ?

Bassel · November 9, 2018, 10:00pm

Hi, thank you for asking the question because I have a similar issue. Were you able to distribute your training over many machines? could you tell me how you did it, please? I don’t have knowledge about parallel or distributed computing