Run multiple distributed training jobs on one machine

I have a machine that has 8 V100s, and I have a small job that only requires 4 V100 to run, so I am trying to run 2 4-V100 distributed training runs on the same machine at the same time, however, this gets me RuntimeError: Address already in use all the time for the second run I launch. It appears as if one distributed training job running on a machine will block any other distributed training runs. Is it possible to work around this so that I can launch 2 distributed jobs on one machine?

I’ve never tried this setup before so apologies if you have already considered it, but what happens when you change the port used for the 2nd distributed job?

That sounds like a reasonable solution! Dumb question here though - how do one change the port used by a distributed job?

It should be close to where you specify the address (e.g., MASTER_PORT here Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.8.1+cu102 documentation).

Thanks for @eqy’s suggestion! I find that if using the PyTorch distributed launch utility script, then there is a --master_port argument (see here) one can use to set the port, and once different distributed jobs are configured to use different ports, the “Address already in use” problem goes away.

1 Like