DataLoader implicitly using CPU?

Hi all,

I’ve been noticing severe CPU bottlenecks in my code, where CPU utilization is almost always ~100% while GPU is consistently underutilized. I tried to send all data into GPU in the first place, and the code runs fine but didn’t give any noticeable speed boost, and autograd profiler shows only a 15% decrease in to(device) calls. Afterwards, I noticed that the total size of my data set is significantly larger than available memory on my current card (~34GB of data on a Tesla K80). So I’m wondering if DataLoader is doing some implicit calls moving data from CPU to GPU or vice versa. Another thing that I thought might be the problem is that two elements in collate_fn isn’t moved to GPU (evt_ids, evt_names), so DataLoader has a mix of GPU/CPU elements. Would this make a difference?

def collate_fn(samples):
    X = [s[0] for s in samples]
    y = [s[1] for s in samples]
    w = [s[2] for s in samples] 
    evt_ids   = [s[3] for s in samples]
    evt_names = [s[4] for s in samples]

    # custom function padding different sized inputs
    X, adj_mask, batch_nb_nodes = pad_batch(X)
    device = torch.device('cuda')

    X = torch.FloatTensor(X).to(device)
    y = torch.FloatTensor(y).to(device)
    w = torch.FloatTensor(w).to(device)
    adj_mask = torch.FloatTensor(adj_mask).to(device)
    batch_nb_nodes = torch.FloatTensor(batch_nb_nodes).to(device)
    return X, y, w, adj_mask, batch_nb_nodes, evt_ids, evt_names

# Loading in multiple files and switch every "epoch"
multi_train_loader = []
for file in args.train_file:
      loader = DataLoader(
                                dataset=file,
                                args.nb_train,
                                batch_size=args.batch_size,
                                shuffle=True,
                                collate_fn=collate_fn,
                                drop_last=True)
      multi_train_loader.append(train_loader)

for i in range(args.nb_epoch):
    train_loader = multi_train_loader[i % len(multi_train_loader)]
    for i, batch in enumerate(train_loader):
        X, y, w, adj_mask, batch_nb_nodes, _, _ = batch

No, the DataLoader will load each sample from Dataset.__getitem__ and use the collate_fn to create a batch out of these samples. It has no knowledge, if these tensors are on the CPU or GPU.

Your current approach of moving the data to the GPU in your collate_fn should have the same effect as moving the data to the device inside the training loop, since each batch will be pushed to the device, not the complete dataset.

For general information about data loading bottlenecks, I would recommend to have a look at this post.

2 Likes

Hi,

so I would like to revisit this topic as I have run into a similar bottleneck recently.

My setup is a relatively small dataset which I fully push to GPU memory in my Dataset __init__ function. Now I too noticed really low GPU utilization when using a standard batched DataLoader.

After some digging and profiling I found that in the current _MapDatasetFetcher:fetch function the batch is constructed by a list comprehension. Note that self.auto_collation will always be True when passing a batch_size argument to the DataLoader constructor.

This seems to be the root of the bottleneck when looking at a profiling trace where I recreate and compare constructing the batch using either with a list comprehension or just by accessing the whole batch at once from the Dataset instance:

Here is the corresponding code:

sampler = RandomSampler(train_dataset)
batch_sampler = BatchSampler(sampler, cfg.params.batch_size, False)

for epoch in range(cfg.params.n_epochs):
    for batch_idx, batch_indices in enumerate(batch_sampler):
        torch.cuda.nvtx.range_push("start fetch style batch retrieval")
        fetch_batch = [train_dataset[idx] for idx in batch_indices]
        fetch_batch_sample = default_collate(fetch_batch)
        torch.cuda.synchronize()
        torch.cuda.nvtx.range_pop()
        
        torch.cuda.nvtx.range_push("start indexed batch retrieval")
        batch_sample = train_dataset[batch_indices]
        torch.cuda.synchronize()
        torch.cuda.nvtx.range_pop()

So I’m wondering what is the purpouse of that list comprehension in the first place? Looking at a git blame it seems to have always been there even before Iterable-Style Datasets were introduced.

1 Like