When I just used one GPU, training speed is 0.55 sec/batch with batch size = 8.
However, once I used 2 GPUs (nn.DataParallel()), the speed became 1.4 sec/batch with batch size = 16 (batch size 8 in each GPU). Also when I used 4 GPU with batch size = 32 (batch size 8 each GPU), speed was 3.4 sec/batch. I checked the all the GPU utils are larger than 90%.
I read a lot about this and I know that using multi GPU can be slower because of communication between GPUs. But compared to single GPU, multi GPU are way too slower. There must exist some problem may be related to my Network.
In my code, there are 4 networks in total, such as:
g = Generator()
d = Discriminator()
a = Detector()
b = Segment()
g = nn.DataParallel(g).to('cuda')
d = nn.DataParallel(d).to('cuda')
a = nn.DataParallel(a).to('cuda')
b = nn.DataParallel(b).to('cuda')
img = torch.rand([1, 3, 256, 245]).to('cuda')
temp1 = a(img)
temp2 = b(temp1)
result = g(temp2)
out_fake = d(result)
Do you think that may be there are too many separated networks that slow down the communication speed between GPUs?
This might be the case, as you would have to scatter and gather the results for each model.
A quick check would be to wrap all models in an nn.Sequential container and pass this to nn.DataParallel.
We generally recommend to use DistributedDataParallel with a single process per device, as it would give you the best speedup.
I am not very sure what "single process per device " mean…
There are two modes in DDP: Single-Process Multi-Device mode (this is retiring) and Single-Process Single-Device mode (this is recommended). Please see this issue for more context, and let us know if you have concerns.
Thanks.
Actually, I found the dataloader could be the potential bottle neck.
The loading time is nearly proportional to the batchsize.
I will make another thread about this to focus on the loading time issue.
@ptrblck@Sergius_Liu
could you please explain how did you put 4 different models in a nn.Sequential container, and then calling different submodules in their forward pass. I was trying to do that, but not sure about forward part.
class GAN(nn.Module):
def __init__(self, args):
super().__init__()
g1 = Generator()
g2 = Generator()
d1 = Discriminator()
d2 = Discriminator()
self.model = nn.Sequential(g1, g2, d1, d2)
self.model.cuda(args.device)
self.model = torch.nn.parallel.DistributedDataParallel(self.model, [self.args.gpu])
init_weights(self.model)
def forward(self, x):
## not sure how can I access individual model and perform forward pass here
return
could you please share a minimal example? also, can we achieve the same thing with nn.ModuleDict? please see the below code, i was trying something like that:
class Network(nn.Module):
def __init__(self, args):
super().__init__()
g1 = Generator()
g2 = Generator()
d1 = Discriminator()
d2 = Discriminator()
self.model = nn.ModuleDict()
self.model.Gab = g1
self.model.Gba = g2
self.model.Da = d1
self.model.Db = d2
self.args = args
self.model.cuda(args.device)
def parallelize(self):
if self.args.distributed:
self.model = torch.nn.parallel.DistributedDataParallel(self.model, [self.args.gpu])
init_weights(self.model)
# for model_key in self.model:
# if self.model[model_key] is not None:
# self.model[model_key] = torch.nn.parallel.DistributedDataParallel(self.model, [self.args.gpu])
# init_weights(self.model[model_key])
else:
init_weights(self.model)
# for model_key in self.model:
# if self.model[model_key] is not None:
# init_weights(self.model[model_key])
def forward(self, x, ) :
**# agiain I am not sure about this part..**