To launch torchrun
on multiple devices you would use torchrun --nproc_per_node==8 ...
which will then correspond to the --local-rank
argument inside your script as described here.
In your approach you are launching your script with torchrun
only and are not using the --local-rank
at all, so again unsure how this should have ever worked.
Alternatively, you can also use a multiprocessing approach inside your script which will spawn the processes there as described in this tutorial.