Running two separate jobs on same gpu server

mendelsontau · December 6, 2022, 8:07am

Hi,
I’m running distributed training with torchrun.
I have 8 V100 GPUs on some server and I am trying to run two separate jobs (same training with different hyperparameters) such that one job will use GPUs 0-3 and the other GPUs 4-7.
What is the right way to do this?
Thanks

ptrblck · December 6, 2022, 8:10am

You can launch two jobs via torchrun by masking the corresponding devices via:

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun ...
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun ...

and set --nproc_per_node 4 in both runs.

mendelsontau · December 6, 2022, 8:17am

Thanks for the prompt response.
When following this i’m getting on the second run
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

ptrblck · December 6, 2022, 8:19am

Try to set the HOST_NODE_ADDR to a different port as described here.