Distributed Training with Slurm

I am trying to train a network distributively on 2 computing nodes. Our lab uses slurm for task scheduling. I am not sure if I am using the correct slurm script for running the training code.

I tried

srun python train.py

Is it the right way to launch the job? I am using gloo backend. How should I verify DistributedDataParallel exchanges and average gradients between two nodes? I just want to make sure they are communicating correctly. Thanks!


@heilaw did you have any success with this?

Found a link here from someone who claimed to get this working on slurm: https://www.glue.umd.edu/hpcc/help/software/pytorch.html#distrib

1 Like