Run multiple distributed training jobs on one machine

justinliu · June 5, 2021, 6:18pm

I have a machine that has 8 V100s, and I have a small job that only requires 4 V100 to run, so I am trying to run 2 4-V100 distributed training runs on the same machine at the same time, however, this gets me RuntimeError: Address already in use all the time for the second run I launch. It appears as if one distributed training job running on a machine will block any other distributed training runs. Is it possible to work around this so that I can launch 2 distributed jobs on one machine?

eqy · June 5, 2021, 7:01pm

I’ve never tried this setup before so apologies if you have already considered it, but what happens when you change the port used for the 2nd distributed job?

justinliu · June 5, 2021, 8:04pm

That sounds like a reasonable solution! Dumb question here though - how do one change the port used by a distributed job?

eqy · June 5, 2021, 9:23pm

It should be close to where you specify the address (e.g., MASTER_PORT here Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.8.1+cu102 documentation).

justinliu · June 5, 2021, 10:14pm

Thanks for @eqy’s suggestion! I find that if using the PyTorch distributed launch utility script, then there is a --master_port argument (see here) one can use to set the port, and once different distributed jobs are configured to use different ports, the “Address already in use” problem goes away.