Thanks for your example, it worked perfectly and I managed to adapt it to my use case. My only problem is that I still can’t infer that I should do that from the docs, but maybe I’m not reading it properly. According to the docs, I understand that I have to provide my CustomBatchSampler
instance as batch_sampler
, not sampler
, because it yields batches of indices. Is my understanding wrong or should the docs be improved (in case an issue about it is needed)?
Curiously, using a for loop on my CustomBatchSampler
in the training loop takes almost the same amount of time than using the DataLoader
adapted from your example. Is that expected? My guess is that its possible because what I’m doing manually, is the same as PyTorch does internally (in a DataLoader
). Also, my dataset is an embedding already present in GPU (in my model), and I transfer the indexes through the forward method, so theres no much gain in using more workers.
- Train loop over
CustomBatchSampler
takes ~40s
- Train loop over
DataLoader(num_workers=0)
takes ~40s
- Train loop over
DataLoader(num_workers=2)
takes ~40s
- Train loop over
DataLoader(num_workers=4)
takes ~40s
(Machine has os.cpu_cores()
= 2)
Could it be that the bottleneck is in the single GPU I’m using? Here’s a simplified sample of the code I’m using. I accumulate both accuracy and loss in CUDA Tensors to avoid transfer to/from CPU:
# CustomBatchSampler version
for data in train_batch_sampler:
data = train_dataset[data]
data_0 = torch.tensor(data[0], device=device)
data_1 = torch.tensor(data[1], device=device)
data_2 = torch.tensor(data[2], device=device)
# Common section
target = torch.ones(..., device=device)
optimizer.zero_grad()
with torch.set_grad_enabled(True):
output = model(data_0, data_1, data_2)
loss = criterion(output, target)
loss.backward()
optimizer.step()
running_acc.add_((output > 0).sum())
running_loss.add_(loss.detach() * output.size(0))
# DataLoader
for data in train_dataloader:
data_0 = data[0].to(device, non_blocking=True).squeeze(dim=0)
data_1 = data[1].to(device, non_blocking=True).squeeze(dim=0)
data_2 = data[2].to(device, non_blocking=True).squeeze(dim=0)
# Common section
target = torch.ones(..., device=device)
optimizer.zero_grad()
with torch.set_grad_enabled(True):
output = model(data_0, data_1, data_2)
loss = criterion(output, target)
loss.backward()
optimizer.step()
running_acc.add_((output > 0).sum())
running_loss.add_(loss.detach() * output.size(0))
Thanks in advance @ptrblck