I am using a cluster to train a recurrent neural network. Since PyTorch seems to automatically thread, it lets me use all the cores of one machine in parallel without having to explicitly program for it. This is great !
Now when I try to use several nodes at the same time using a script like this one :
#$ -S /bin/bash #$ -N comparison_of_architecture #$ -pe mvapich2-rostam 32 #4 -tc 4 #$ -o /scratch04.local/cnelias/Deep-Jazz/logs/out_comparison_training.txt #$ -e /scratch04.local/cnelias/Deep-Jazz/logs/err_comparison_training.txt #$ -t 1 #$ -cwd
I see that 4 nodes are being used but only one is actually doing work, so “only” 32 cores are in use.
I have no knowledge of parallel programming and I don’t understand a thing in the tutorial provided on PyTorch’s website, I am afraid this is completely out of my scope.
Are you aware of a simple way to let a PyTorch program run on several machines without having to explicitly program the exchanges of the messages and computation between these machines ?
PS : I unfortunately don’t have a GPU and the cluster I am using also doesn’t, otherwise I would have tried it.