I am trying to train a network distributively on 2 computing nodes. Our lab uses slurm for task scheduling. I am not sure if I am using the correct slurm script for running the training code.
I tried
srun python train.py
Is it the right way to launch the job? I am using gloo
backend. How should I verify DistributedDataParallel
exchanges and average gradients between two nodes? I just want to make sure they are communicating correctly. Thanks!