Workload of 4 GPU is not equally split

Hi everybody,
I use two models and initialized them like that:

if torch.cuda.device_count() > 1:
    model_cnn = nn.DataParallel(model_cnn)
    model_fc = nn.DataParallel(model_fc)

model_cnn = model_cnn.to(device)
model_fc = model_fc.to(device)

I trained my model like that:

  for epoch in range(num_epochs):
            iter_loader_source = iter(source_loader)
            for _ in range(len(iter_loader_source)):
                 batch_data_source, labels_source = iter_loader_source.next() 
                 batch_data_source = batch_data_source.to(device)
                 labels_source = labels_source.to(device)
                 x_fc1_source = self.model_cnn(batch_data_source.float())
                 x_fc3_source = self.model_fc(x_fc1_source)

                optimizer1.zero_grad()
                loss.backward()
                optimizer1.step()

the loss is a combination of an MMD and CE loss. MMD loss is defined by myself and CE-loss is used from torch.nn. Both losses were also transfered to GPU:

criterion = torch.nn.CrossEntropyLoss().to(device)
MMD_loss_calculator = MMD_loss_calculator.to(device)

The device is defined like that:

  device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

I use a batch size of 32. When i print the inpit size of the CNN I receive the following:
In Model: input size torch.Size([8, 1, 1024])
In Model: input size torch.Size([8, 1, 1024])
In Model: input size torch.Size([8, 1, 1024])
In Model: input size torch.Size([8, 1, 1024])

It therefore seems like the models are on 4 different gpus and the data is split on those 4 gpus equally.

When i check the GPU utilization with “nvidia-smi --query-gpu=utilization.gpu --format=csv -l 5” I get the following:
84 %
1 %
1 %
2 %

It seems like just GPU 0 works properly. How can that be? How can I use all 4 GPUs equally? How is the data split between the GPUs (randomly, or in data[: batch],data[: batch:batch*],data[batch2:batch3] and data[batch*3:] ?

nn.DataParallel will use the default GPU (GPU0 in your case) to store the original model, the input data, the outputs etc. and will thus add an overhead to this device as well as an imbalanced memory usage.
We thus recommend to use DistributedDataParallel which would not suffer from these issues and would also show a better performance compared to DataParallel.

I have problems initializing the torch.distirbuted package. I do not get any errors, but also the training does not start. Is that the correct way to use torch.distributed?

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '6006'
num_gpu = torch.cuda.device_count()
for i in range(num_gpu):
    torch.distributed.init_process_group(backend='nccl', world_size=num_gpu, rank=i)

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model_cnn = nn.parallel.DistributedDataParallel(model_cnn)
    model_fc = nn.parallel.DistributedDataParallel(model_fc)

#model_cnn = nn.DataParallel(model_cnn)
model_cnn = model_cnn.to(device)

#model_fc = nn.DataParallel(model_fc)
model_fc = model_fc.to(device)

Try to execute this tutorial first and make sure it’s working correctly. Once this use case works fine and you are able to launch it via the spawn or torchrun approach, you could then compare it to your script.