DataLoader implicitly using CPU?

Hi all,

I’ve been noticing severe CPU bottlenecks in my code, where CPU utilization is almost always ~100% while GPU is consistently underutilized. I tried to send all data into GPU in the first place, and the code runs fine but didn’t give any noticeable speed boost, and autograd profiler shows only a 15% decrease in to(device) calls. Afterwards, I noticed that the total size of my data set is significantly larger than available memory on my current card (~34GB of data on a Tesla K80). So I’m wondering if DataLoader is doing some implicit calls moving data from CPU to GPU or vice versa. Another thing that I thought might be the problem is that two elements in collate_fn isn’t moved to GPU (evt_ids, evt_names), so DataLoader has a mix of GPU/CPU elements. Would this make a difference?

def collate_fn(samples):
    X = [s[0] for s in samples]
    y = [s[1] for s in samples]
    w = [s[2] for s in samples] 
    evt_ids   = [s[3] for s in samples]
    evt_names = [s[4] for s in samples]

    # custom function padding different sized inputs
    X, adj_mask, batch_nb_nodes = pad_batch(X)
    device = torch.device('cuda')

    X = torch.FloatTensor(X).to(device)
    y = torch.FloatTensor(y).to(device)
    w = torch.FloatTensor(w).to(device)
    adj_mask = torch.FloatTensor(adj_mask).to(device)
    batch_nb_nodes = torch.FloatTensor(batch_nb_nodes).to(device)
    return X, y, w, adj_mask, batch_nb_nodes, evt_ids, evt_names

# Loading in multiple files and switch every "epoch"
multi_train_loader = []
for file in args.train_file:
      loader = DataLoader(

for i in range(args.nb_epoch):
    train_loader = multi_train_loader[i % len(multi_train_loader)]
    for i, batch in enumerate(train_loader):
        X, y, w, adj_mask, batch_nb_nodes, _, _ = batch

No, the DataLoader will load each sample from Dataset.__getitem__ and use the collate_fn to create a batch out of these samples. It has no knowledge, if these tensors are on the CPU or GPU.

Your current approach of moving the data to the GPU in your collate_fn should have the same effect as moving the data to the device inside the training loop, since each batch will be pushed to the device, not the complete dataset.

For general information about data loading bottlenecks, I would recommend to have a look at this post.

1 Like