[SLOVED] Can't tell if my data parallelism is working or not!

EDITED: You have to run more than 1 batch for the other gpu to kick in or other wise it will idle. Gosh I feel stupid…

I have downloaded the tutorial Jupiter notebook on data parallelism and I know it works because it gave me the correct outputs for 2 gpu setup. I followed the same idea in my custom code which build on top of cyclegan. In the net work itself there are 2 generator and 2 discriminators, so I did:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

netG_A2B = Generator(opt.input_nc, opt.output_nc, opt.ngf, opt.residual_num)#.cuda()

netG_B2A = Generator(opt.output_nc, opt.input_nc, opt.ngf, opt.residual_num)#.cuda()

netD_A = Discriminator(opt.input_nc, opt.ndf)#.cuda()

netD_B = Discriminator(opt.output_nc, opt.ndf)#.cuda()





netG_A2B = nn.DataParallel(netG_A2B)
netG_B2A = nn.DataParallel(netG_B2A)
netD_A = nn.DataParallel(netD_A)
netD_B = nn.DataParallel(netD_B)


I have also applied my inputs with to.device() like what is shown in the tutorial notebook, there inputs where later feed into the corresponding netGs:

for idx, _ in enumerate(dataloader):
        realA = _['portrait'].to(device)
        realB = _['lineart'].to(device)

I first get a warning saying

Anaconda3\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support
warnings.warn(‘PyTorch is not compiled with NCCL support’)
Quick googling shows its an warning related to multi gpu support?
If I check my graphics card monitoring app I see

You can see both GPU clocks are maxed in the 1st panel on right side, however if you check the second panel on the left side you see only one GPU usage is at 99% while the other idles at 0%?
I am currently trying to test out speed for a single card but I am running into out of memory error…

I’m not sure if this will help or not but i would try the following code:

use_cuda = torch.cuda.device_count() >= 1
device = torch.device("cuda" if use_cuda else "cpu")

netG_A2B = nn.DataParallel(netG_A2B.to(device))
netG_B2A = nn.DataParallel(netG_B2A.to(device))
netD_A = nn.DataParallel(netD_A.to(device))
netD_B = nn.DataParallel(netD_B.to(device))

Hi, same exact same thing happened, first the NCCL warning and then everything runs slow but correctly. My batch size is[1, 3, 256, 256] and runs 80 times for 80 files, setting batch size higher would trigger out of memory error… and I set the total epoch to 1 for testing.
I have recorded time it took to complete 1 epoch with both DataParalleled and calling only .cuda().

DataParallel : Completed 1 epoch in 106.87 seconds, NCCL Warning: Yes
.cuda() : Completed 1 epoch in 105.28 seconds, NCCL Warning: No

As you can see calling .cuda() and DataParallel has no effect for me, I did not call one model directly after the another, I made sure I logged out to clear up the memory. If I look it my evga precision X it logs the same thing for both cases as well. 99% usage on 1 card and 0 % usage on the other, however both card’s GPU clock would be maxed out, except maybe card 2 which runs slightly lower speed than card 1.

Edited: Downloaded GPUz which shows both gpu running but I am currently only using .cuda() which by default only suppose to use 1 gpu? I honestly don’t know whats going on.

Final: Edited SLOVED I feel stupid but… You need to run more than 1 batch for the other GPU to kick in. Now everything works.