I try to train ImageNet on 8-gpu server. However, I have enough memory after I start to train resnet50 using distributed training. So I want to launch another training, but the error shows that the address has been used. Can we achieve two distributed training in a single machine ?
How are you launching the distributed training processes?
If manually using
torch.distributed, try setting the
master_port setting. The way to do this differs. Refer here.
If you’re doing it using
torch.distributed.launch utility, then try setting the
Finally, it isn’t always all about the GPU memory left unused: Check if your data loader is fast enough? You could do this by checking your CPU usage. If all your CPUs are maxing out, or waiting on data io or whatever the case: if your GPU is idle waiting on data, then launching another training will just load the CPUs more, thus using even lesser GPU since the GPU would be idle most of the time waiting for data loading.