GPU not fully used with dataloader

I’m a beginner with pytorch.
I’m using pytorch to build a CNN for object detection.
I encounter something really weird when using DataLoader to feed data.

The size of my input images are [3, 640, 640].
Normally, with 2 GPUs of 12 GB, I can only feed about 8 images once a time. However, when I used DataLoader with batchsize = 8, I marked a very low usage of GPU memory.(About 20%) Then I increased batchsize = 40, I started to see a 90% usage.

To figure out what’s wrong. I run a sample example with resnet50 backbones.

import torch.utils.data
import torchvision


net = torchvision.models.resnet50(pretrained=True)
#net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count()))
net.cuda()


class RandomDataset(torch.utils.data.Dataset):
    def __init__(self):
        super(RandomDataset, self).__init__()
    
    def __getitem__(self, image_id):
        return torch.rand([3,640,640])
    
    def __len__(self):
        return 1000

ds = RandomDataset()

dl = torch.utils.data.DataLoader(ds, batch_size=15, shuffle=False, num_workers=8)

with torch.no_grad():
    for batch_idx, inputs in enumerate(dl):
        net.eval()
        inputs = inputs.cuda()
        out = net(inputs)
        
        
print((torch.cuda.max_memory_cached(0)+torch.cuda.max_memory_allocated(0))/1024/1024/1024)

So, 1 GPU can only feed 15 images normally. (If I don’t use dataloader, GPU reaches 100% usage) While, in DataLoader, it costs only 2.5Gb. It reaches 100% when I set batch_size = 100 which is totally not reasonable.

Can anyone help me with this? Do I make some mistakes somewhere? Or I misunderstand the meaning of batch_size?

CPU is responsible for fetching data from dataloader not GPU. So when you use batchsize=8, GPU had to do less work than with batchsize=40

Thank you for your reply.
I have also supervised my CPU usage by top, which never exceed 15%.
I suppose it’s not the main reason, or there is default limit concerning CPU usage in python?

By the way, in my simple example code, I just create random tensors and feed them into ResNet50. Every component is by default pytorch. I can’t imagine it’s due to CPU capacity.

Hi,

the GPU usage you mentioned above does not make sense to me.
If it means Utilization check by nvidia-smi, I think this thread can help you.

In my shallow view, there are many factors can effect GPU utilization when you load data by DataLoader, such as batch_size, pin_memory and num_workers. Generally, the more batch_size the more utilization will be and set pin_memory=True can also have an improvement, as for num_works you could have an experiment to fit your own dataset and hardware.

I can’t make sure these can help you, but it works for me.:wink:

Thank you so much!
That thread help me a lot. What i have encountered is quite similar to your case.
So, now I’m sure that there is nothing wrong with my environment and my implementation.
However, in that case, how should I choose my batch_size? Before, I increase batch_size until the GPU utilization up to max in nvidia-smi, which turns out to be not referable.