Hi,
I’m running distributed training with torchrun.
I have 8 V100 GPUs on some server and I am trying to run two separate jobs (same training with different hyperparameters) such that one job will use GPUs 0-3 and the other GPUs 4-7.
What is the right way to do this?
Thanks
You can launch two jobs via torchrun
by masking the corresponding devices via:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun ...
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun ...
and set --nproc_per_node 4
in both runs.
Thanks for the prompt response.
When following this i’m getting on the second run
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).