I’ve been noticing severe CPU bottlenecks in my code, where CPU utilization is almost always ~100% while GPU is consistently underutilized. I tried to send all data into GPU in the first place, and the code runs fine but didn’t give any noticeable speed boost, and
autograd profiler shows only a 15% decrease in
to(device) calls. Afterwards, I noticed that the total size of my data set is significantly larger than available memory on my current card (~34GB of data on a Tesla K80). So I’m wondering if
DataLoader is doing some implicit calls moving data from CPU to GPU or vice versa. Another thing that I thought might be the problem is that two elements in
collate_fn isn’t moved to GPU (
evt_ids, evt_names), so
DataLoader has a mix of GPU/CPU elements. Would this make a difference?
X = [s for s in samples]
y = [s for s in samples]
w = [s for s in samples]
evt_ids = [s for s in samples]
evt_names = [s for s in samples]
# custom function padding different sized inputs
X, adj_mask, batch_nb_nodes = pad_batch(X)
device = torch.device('cuda')
X = torch.FloatTensor(X).to(device)
y = torch.FloatTensor(y).to(device)
w = torch.FloatTensor(w).to(device)
adj_mask = torch.FloatTensor(adj_mask).to(device)
batch_nb_nodes = torch.FloatTensor(batch_nb_nodes).to(device)
return X, y, w, adj_mask, batch_nb_nodes, evt_ids, evt_names
# Loading in multiple files and switch every "epoch"
multi_train_loader = 
for file in args.train_file:
loader = DataLoader(
for i in range(args.nb_epoch):
train_loader = multi_train_loader[i % len(multi_train_loader)]
for i, batch in enumerate(train_loader):
X, y, w, adj_mask, batch_nb_nodes, _, _ = batch
DataLoader will load each sample from
Dataset.__getitem__ and use the
collate_fn to create a batch out of these samples. It has no knowledge, if these tensors are on the CPU or GPU.
Your current approach of moving the data to the GPU in your
collate_fn should have the same effect as moving the data to the device inside the training loop, since each batch will be pushed to the device, not the complete dataset.
For general information about data loading bottlenecks, I would recommend to have a look at this post.
so I would like to revisit this topic as I have run into a similar bottleneck recently.
My setup is a relatively small dataset which I fully push to GPU memory in my Dataset
__init__ function. Now I too noticed really low GPU utilization when using a standard batched DataLoader.
After some digging and profiling I found that in the current _MapDatasetFetcher:fetch function the batch is constructed by a list comprehension. Note that
self.auto_collation will always be
True when passing a
batch_size argument to the DataLoader constructor.
This seems to be the root of the bottleneck when looking at a profiling trace where I recreate and compare constructing the batch using either with a list comprehension or just by accessing the whole batch at once from the Dataset instance:
Here is the corresponding code:
sampler = RandomSampler(train_dataset)
batch_sampler = BatchSampler(sampler, cfg.params.batch_size, False)
for epoch in range(cfg.params.n_epochs):
for batch_idx, batch_indices in enumerate(batch_sampler):
torch.cuda.nvtx.range_push("start fetch style batch retrieval")
fetch_batch = [train_dataset[idx] for idx in batch_indices]
fetch_batch_sample = default_collate(fetch_batch)
torch.cuda.nvtx.range_push("start indexed batch retrieval")
batch_sample = train_dataset[batch_indices]
So I’m wondering what is the purpouse of that list comprehension in the first place? Looking at a git blame it seems to have always been there even before Iterable-Style Datasets were introduced.