Training slower when using multi GPUs

TANKUN_LI · June 9, 2020, 8:01am

Hi, I am using pytorch to train a GAN.

When I just used one GPU, training speed is 0.55 sec/batch with batch size = 8.
However, once I used 2 GPUs (nn.DataParallel()), the speed became 1.4 sec/batch with batch size = 16 (batch size 8 in each GPU). Also when I used 4 GPU with batch size = 32 (batch size 8 each GPU), speed was 3.4 sec/batch. I checked the all the GPU utils are larger than 90%.

I read a lot about this and I know that using multi GPU can be slower because of communication between GPUs. But compared to single GPU, multi GPU are way too slower. There must exist some problem may be related to my Network.

In my code, there are 4 networks in total, such as:

g = Generator()
d = Discriminator()
a = Detector()
b = Segment()

g = nn.DataParallel(g).to('cuda')
d = nn.DataParallel(d).to('cuda')
a = nn.DataParallel(a).to('cuda')
b = nn.DataParallel(b).to('cuda')

img = torch.rand([1, 3, 256, 245]).to('cuda')
temp1 = a(img)
temp2 = b(temp1)
result = g(temp2)
out_fake = d(result)

Do you think that may be there are too many separated networks that slow down the communication speed between GPUs?

ptrblck · June 10, 2020, 6:57am

This might be the case, as you would have to scatter and gather the results for each model.
A quick check would be to wrap all models in an nn.Sequential container and pass this to nn.DataParallel.

We generally recommend to use DistributedDataParallel with a single process per device, as it would give you the best speedup.

Sergius_Liu · October 28, 2020, 5:36pm

I also ran into similar issue that 4 gpu with 4 x batch_size almost cost 4 times processing time of 1 gpu with 1x batch_size iteration.

For this statement:

Could you give an examples?
I am not very sure what "single process per device " mean…

mrshenli · October 28, 2020, 10:24pm

Hey @Sergius_Liu

Here is an example: Distributed Data Parallel — PyTorch 2.1 documentation

I am not very sure what "single process per device " mean…

There are two modes in DDP: Single-Process Multi-Device mode (this is retiring) and Single-Process Single-Device mode (this is recommended). Please see this issue for more context, and let us know if you have concerns.

Sergius_Liu · October 30, 2020, 11:01pm

Thanks.
Actually, I found the dataloader could be the potential bottle neck.
The loading time is nearly proportional to the batchsize.
I will make another thread about this to focus on the loading time issue.

TANKUN_LI · November 10, 2020, 9:08am

Sorry for the late response. I change it to nn.Sequential and it is faster.

Thx!

pritamqu · March 4, 2021, 6:46pm

@ptrblck @Sergius_Liu
could you please explain how did you put 4 different models in a nn.Sequential container, and then calling different submodules in their forward pass. I was trying to do that, but not sure about forward part.

class GAN(nn.Module):

    def __init__(self, args):
        super().__init__()
        
        g1 = Generator()
        g2 = Generator()
        d1 = Discriminator()
        d2 = Discriminator()
        
        self.model = nn.Sequential(g1, g2, d1, d2)
        self.model.cuda(args.device)
        self.model = torch.nn.parallel.DistributedDataParallel(self.model, [self.args.gpu])
        init_weights(self.model)

    def forward(self, x):
        ## not sure how can I access individual model and perform forward pass here
        
        return

could you please share a minimal example? also, can we achieve the same thing with nn.ModuleDict? please see the below code, i was trying something like that:

class Network(nn.Module):

    def __init__(self, args):
        super().__init__()        
        
        g1 = Generator()
        g2 = Generator()
        d1 = Discriminator()
        d2 = Discriminator()
        
        self.model = nn.ModuleDict()
        
        self.model.Gab = g1 
        self.model.Gba = g2
        self.model.Da = d1
        self.model.Db = d2
        
        self.args = args
        self.model.cuda(args.device)
        
        def parallelize(self):
            
            if self.args.distributed:
                self.model = torch.nn.parallel.DistributedDataParallel(self.model, [self.args.gpu])
                init_weights(self.model)

                # for model_key in self.model:
                #     if self.model[model_key] is not None:
                #         self.model[model_key] = torch.nn.parallel.DistributedDataParallel(self.model, [self.args.gpu])
                #         init_weights(self.model[model_key])
                        
            else:
                init_weights(self.model)
                # for model_key in self.model:
                #     if self.model[model_key] is not None:
                #         init_weights(self.model[model_key])
                
        def forward(self, x, ) :
                **# agiain I am not sure about this part..**