Help running PyTorch on HPC - Clusters

So I dont have access to a GPU ,but have access to a 25 cluster of Xeon CPUs . Here are my questions :

  1. If I set the number of nodes in qpub to 24 , will PyTorch automatically use all the nodes , because someone earlier said that it relies on Intel MKL and MKL should be able to detect all the CPUs available.

  2. Also , does setting the num_workers take care of automatically distributing the workload ?

pytorch will use all the cores on a single machine.
If you want to use 25 Xeon machines, then you will have to write some special logic using our torch.distributed functions: http://pytorch.org/docs/master/distributed.html

Given that the torch.distributed module is still in beta, does torch.multiprocessing have any disadvantge performance wise other than running each script separately ?

Hi, thank you for asking the question because I have a similar issue. Were you able to distribute your training over many machines? could you tell me how you did it, please? I don’t have knowledge about parallel or distributed computing

1 Like