GPU usage does not remain high for lightweight models when loaded CIFAR-10 as a custom dataset

I am experimenting with the following repository. GitHub - Keiku/PyTorch-Lightning-CIFAR10: "Not too complicated" training code for CIFAR-10 by PyTorch Lightning

I have implemented two methods, one is to load CIFAR-10 from torchvision and the other is to load CIFAR-10 as a custom dataset. Also, I have implemented two models: a lightweight model (eg scratch resnet18, timm MobileNet V3, etc.) and a relatively heavy model (eg scratch resnet50, timm resnet152).

After some experiments, I found the following.

  • GPU usage remains high (nearly 100%) on any model when loading CIFAR-10 with torchvision
  • When loading CIFAR-10 as a custom dataset, GPU usage remains relatively high (still temporarily zero) for heavy models
  • When loading CIFAR-10 as a custom dataset, GPU usage remains low (going back and forth between 0% and 100%) for lightweight models (resnet18, MobileNetV3)

In this situation, is there a problem with the implementation code of the custom dataset? Also, please let me know if there is a way to increase GPU usage even for lightweight models.

I am experimenting in the following EC2 g4dn.xlarge environment.

⋊> ~ lsb_release -a                                                    (base) 21:45:51
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.5 LTS
Release:        18.04
Codename:       bionic
⋊> ~ nvidia-container-cli info                                         (base) 21:48:20
NVRM version:   450.80.02
CUDA version:   11.0

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Tesla
GPU UUID:       GPU-ba54be15-066e-e7e5-87d0-84b8ac2672c6
Bus Location:   00000000:00:1e.0
Architecture:   7.5

Your “lightweight models” need less GPU compute and thus shift the overall computation more towards the CPU workload, which is most likely defined by the data loading.
In such use cases (i.e. using tiny models), you would have to make sure the data loading won’t be a bottleneck, since the GPU workload is tiny as explained before.

Based on your observations it seems that the custom CIFAR dataset is slower in the data loading pipeline than the torchvision implementation, which lets the GPU starve especially for tiny workloads.

The implementation of my custom dataset is simple. I have implemented it as follows, but what is the likely bottleneck?

import re
from pathlib import Path

import numpy as np
import pandas as pd
import torch
from PIL import Image


class CIFAR10Dataset(torch.utils.data.Dataset):
    def __init__(self, cfg, train, transform=None):
        super(CIFAR10Dataset, self).__init__()
        self.transform = transform
        self.cfg = cfg
        self.split_dir = "train" if train else "test"
        self.root_dir = Path(cfg.dataset.root_dir)
        self.image_dir = self.root_dir / "cifar" / self.split_dir
        self.file_list = [p.name for p in self.image_dir.rglob("*") if p.is_file()]
        self.labels = [re.split("_|\.", l)[1] for l in self.file_list]
        self.targets = self.label_mapping(cfg)

    def label_mapping(self, cfg):
        labels = self.labels
        label_mapping_path = Path(cfg.dataset.root_dir) / "cifar/labels.txt"
        df_label_mapping = pd.read_table(label_mapping_path.as_posix(), names=["label"])
        df_label_mapping["target"] = range(cfg.train.num_classes)

        label_mapping_dict = dict(
            zip(
                df_label_mapping["label"].values.tolist(),
                df_label_mapping["target"].values.tolist(),
            )
        )

        targets = [label_mapping_dict[i] for i in labels]
        return targets

    def __getitem__(self, index):
        filename = self.file_list[index]
        targets = self.targets[index]
        image_path = self.image_dir / filename
        image = Image.open(image_path.as_posix())

        if self.transform is not None:
            transform = self.transform
            image = transform(image)

        return image, targets

    def __len__(self):
        return len(self.file_list)

I’m sorry I understand the cause. Loading images on AWS EFS was the cause of low GPU usage. GPU uasge remained high (nearly 100%) when loaded from AWS EBS.