I have implemented two methods, one is to load CIFAR-10 from torchvision and the other is to load CIFAR-10 as a custom dataset. Also, I have implemented two models: a lightweight model (eg scratch resnet18, timm MobileNet V3, etc.) and a relatively heavy model (eg scratch resnet50, timm resnet152).
After some experiments, I found the following.
GPU usage remains high (nearly 100%) on any model when loading CIFAR-10 with torchvision
When loading CIFAR-10 as a custom dataset, GPU usage remains relatively high (still temporarily zero) for heavy models
When loading CIFAR-10 as a custom dataset, GPU usage remains low (going back and forth between 0% and 100%) for lightweight models (resnet18, MobileNetV3)
In this situation, is there a problem with the implementation code of the custom dataset? Also, please let me know if there is a way to increase GPU usage even for lightweight models.
I am experimenting in the following EC2 g4dn.xlarge environment.
⋊> ~ lsb_release -a (base) 21:45:51
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
⋊> ~ nvidia-container-cli info (base) 21:48:20
NVRM version: 450.80.02
CUDA version: 11.0
Device Index: 0
Device Minor: 0
Model: Tesla T4
Brand: Tesla
GPU UUID: GPU-ba54be15-066e-e7e5-87d0-84b8ac2672c6
Bus Location: 00000000:00:1e.0
Architecture: 7.5
Your “lightweight models” need less GPU compute and thus shift the overall computation more towards the CPU workload, which is most likely defined by the data loading.
In such use cases (i.e. using tiny models), you would have to make sure the data loading won’t be a bottleneck, since the GPU workload is tiny as explained before.
Based on your observations it seems that the custom CIFAR dataset is slower in the data loading pipeline than the torchvision implementation, which lets the GPU starve especially for tiny workloads.
I’m sorry I understand the cause. Loading images on AWS EFS was the cause of low GPU usage. GPU uasge remained high (nearly 100%) when loaded from AWS EBS.