GPU memory utilization is constant while processing is 0% on second GPU

Hello! I’m trying to train a U-Net on a dual GPU setup using torch.nn.DataParallel. However, most of the time only the first GPU is utilized while the second one stays idle. On the other hand, if I look at memory usage using nvidia-smi it seems the load is divided equally between the 2 GPUs

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3023      C   python                           3639MiB |
|    1   N/A  N/A      3023      C   python                           3503MiB |
+-----------------------------------------------------------------------------+
Tue Aug  3 21:02:49 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:05:00.0 Off |                  N/A |
| 27%   66C    P2   228W / 250W |   3661MiB / 11178MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:86:00.0 Off |                  N/A |
| 20%   44C    P2    74W / 250W |   3517MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

This is how I load the model to the GPUs:

    if torch.cuda.is_available():
        device = torch.device('cuda')
        model = torch.nn.DataParallel(model).to(device)

I am using batch as a first dimension to my tensor so dim=0 is used by default. While training I transfer my tensors to the GPU like so:

imgA = imgA.float().cuda()

My model and submodules are implemented via nn.Module

This is my dataset:

class CityscapesExt(Cityscapes):

    voidClass = 19

    # Convert ids to train_ids
    id2trainid = np.array([label.train_id for label in Cityscapes.classes if label.train_id >= 0], dtype='uint8')
    id2trainid[np.where(id2trainid == 255)] = voidClass

    # Convert train_ids to colors
    mask_colors = [list(label.color) for label in Cityscapes.classes if label.train_id >= 0 and label.train_id <= 19]
    mask_colors.append([0,0,0])
    mask_colors = np.array(mask_colors)

    # List of valid class ids
    validClasses = np.unique([label.train_id for label in Cityscapes.classes if label.id >= 0])
    validClasses[np.where(validClasses == 255)] = voidClass
    validClasses = list(validClasses)

    # Create list of class names
    classLabels = [label.name for label in Cityscapes.classes if not (label.ignore_in_eval or label.id < 0)]
    classLabels.append('void')

    def __getitem__(self, index):

        filepath = self.images[index]
        image = Image.open(filepath).convert('RGB')

        targets = []
        for i, t in enumerate(self.target_type):
            if t == 'polygon':
                target = self._load_json(self.targets[index][i])
            else:
                target = Image.open(self.targets[index][i])

            targets.append(target)

        target = tuple(targets) if len(targets) > 1 else targets[0]

        if self.transforms is not None:
            if self.split == 'train':
                image_A, image_B, affine2_to_1, target, flip = self.transforms(image, target)
                target = self.id2trainid[target]
                return image_A, image_B, affine2_to_1, target, flip

num_workers=8 and pin_memory=True are also set
Am I missing something here or this is expected behavior?

Could you check the batch size and make sure it’s >1 so that it can be split in dim0 and each GPU would get at least one sample?
If that’s already the case, you could check the .device attribute of the input tensor inside the forward method of your model and verify that the tensors are placed on both GPUs.
Generally, I would also recommend to use DistributedDataParallel for the best performance instead of nn.DataParallel.