How to launch two distributed programs on single machine?

I try to train ImageNet on 8-gpu server. However, I have enough memory after I start to train resnet50 using distributed training. So I want to launch another training, but the error shows that the address has been used. Can we achieve two distributed training in a single machine ?

How are you launching the distributed training processes?

If manually using torch.distributed, try setting the master_port setting. The way to do this differs. Refer here.

If you’re doing it using torch.distributed.launch utility, then try setting the --master_port flag.

Finally, it isn’t always all about the GPU memory left unused: Check if your data loader is fast enough? You could do this by checking your CPU usage. If all your CPUs are maxing out, or waiting on data io or whatever the case: if your GPU is idle waiting on data, then launching another training will just load the CPUs more, thus using even lesser GPU since the GPU would be idle most of the time waiting for data loading.

1 Like