Distributed Training with Slurm

heilaw · August 30, 2017, 6:56pm

I am trying to train a network distributively on 2 computing nodes. Our lab uses slurm for task scheduling. I am not sure if I am using the correct slurm script for running the training code.

I tried

srun python train.py

Is it the right way to launch the job? I am using gloo backend. How should I verify DistributedDataParallel exchanges and average gradients between two nodes? I just want to make sure they are communicating correctly. Thanks!

mortonjt · September 18, 2019, 4:07pm

@heilaw did you have any success with this?

Found a link here from someone who claimed to get this working on slurm: https://www.glue.umd.edu/hpcc/help/software/pytorch.html#distrib