Distributed training on slurm cluster

The script looks fine, but you might want to replace the launch command with torchrun as the former is (or will be) deprecated. Are you able to check the GPU utilization on this node and could you check if all devices are used?

1 Like