Force DataLoader to fetch batched index from custom batch sampler

Hi eveyone,

I’m working with a custom Dataset and BatchSampler. Due to the nature of my data, I have to fetch batches of different sizes, that’s why I’m using a CustomBatchSampler. Because of this, DataLoaders try to fetch items from my CustomDataset one item at each time.

  • As you can see here, if I provide a batch_sampler to a DataLoader, self.auto_collation becomes True.
  • Then, because self.auto_collation is True, items are fetched one by one, as you can see here.

As a consequence, my code is faster if I iterate over my CustomBatchSampler and fetch multiple items manually, instead of using a DataLoader. I built my Dataset to support fetching multiple indexes at once. Also, I lose the possibility of using multiple workers easily (using a DataLoader as is).

Is it possible to force a DataLoader to fetch multiple items at once, like data = self.dataset[possibly_batched_index]? (like when self.auto_collation is False, in the second link). Is there a different approach expected to consider when using a CustomBatchSampler inside a DataLoader?

Thanks!

I think the BatchSampler will make sure to pass all batch indices to your Dataset's __getitem__ method as seen in this example:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(100).view(100, 1).float()
        
    def __getitem__(self, index):
        print(index)
        x = self.data[index]
        return x
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()        
sampler = torch.utils.data.sampler.BatchSampler(
    torch.utils.data.sampler.RandomSampler(dataset),
    batch_size=10,
    drop_last=False)

loader = DataLoader(
    dataset,
    sampler=sampler)

for data in loader:
    print(data)

If you run it, you’ll see that the index inside the Dataset.__getitem__ will contain 10 indices, which can be used to slice the data directly.

Let me know, if I misunderstood the question.

5 Likes

Thanks for your example, it worked perfectly and I managed to adapt it to my use case. My only problem is that I still can’t infer that I should do that from the docs, but maybe I’m not reading it properly. According to the docs, I understand that I have to provide my CustomBatchSampler instance as batch_sampler, not sampler, because it yields batches of indices. Is my understanding wrong or should the docs be improved (in case an issue about it is needed)?

Curiously, using a for loop on my CustomBatchSampler in the training loop takes almost the same amount of time than using the DataLoader adapted from your example. Is that expected? My guess is that its possible because what I’m doing manually, is the same as PyTorch does internally (in a DataLoader). Also, my dataset is an embedding already present in GPU (in my model), and I transfer the indexes through the forward method, so theres no much gain in using more workers.

  • Train loop over CustomBatchSampler takes ~40s
  • Train loop over DataLoader(num_workers=0) takes ~40s
  • Train loop over DataLoader(num_workers=2) takes ~40s
  • Train loop over DataLoader(num_workers=4) takes ~40s

(Machine has os.cpu_cores() = 2)

Could it be that the bottleneck is in the single GPU I’m using? Here’s a simplified sample of the code I’m using. I accumulate both accuracy and loss in CUDA Tensors to avoid transfer to/from CPU:

# CustomBatchSampler version
for data in train_batch_sampler:
    data = train_dataset[data]
    data_0 = torch.tensor(data[0], device=device)
    data_1 = torch.tensor(data[1], device=device)
    data_2 = torch.tensor(data[2], device=device)

    # Common section
    target = torch.ones(..., device=device)
    optimizer.zero_grad()
    with torch.set_grad_enabled(True):
        output = model(data_0, data_1, data_2)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    running_acc.add_((output > 0).sum())
    running_loss.add_(loss.detach() * output.size(0))
# DataLoader
for data in train_dataloader:
    data_0 = data[0].to(device, non_blocking=True).squeeze(dim=0)
    data_1 = data[1].to(device, non_blocking=True).squeeze(dim=0)
    data_2 = data[2].to(device, non_blocking=True).squeeze(dim=0)

    # Common section
    target = torch.ones(..., device=device)
    optimizer.zero_grad()
    with torch.set_grad_enabled(True):
        output = model(data_0, data_1, data_2)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    running_acc.add_((output > 0).sum())
    running_loss.add_(loss.detach() * output.size(0))

Thanks in advance @ptrblck

I think I might have misunderstood the use case and might have given a wrong example.

The batch_sampler argument in the DataLoader will accept a sampler, which returns a batch of indices. Internally it will use the list comprehension (which you’ve linked to in the first post) and pass each index separately to __getitem__. This would make sure that the behavior of your custom Dataset can stay the same using the “standard” sampler and a BatchSampler.

However, we could just pass the BatchSampler as the sampler argument to the DataLoader (which I’ve done), so that we will get the complete batch of indices in __getitem__.

Let me know, if this is your use case or if I misunderstood it.

Besides the batching, shuffling, drop_last feature etc., the DataLoader might speed up your data loading pipeline, if you could load the data and process it in the background while your GPU is busy.
If your data is already in the system memory and you could just slice it to create batches, you most likely won’t see a speedup (and might even slow down the code due to some overhead in the DataLoader).

1 Like

Yes, that was exactly the case and your example worked perfectly. The only difference with my initial code was that I passed my CustomBatchSampler instance as batch_sampler, not sampler, but as I said (thanks to your example) is fixed now.

Well, that explains the timing I got. For the record, the times of my runs were almost the same iterating over both objects (DataLoader or CustomBatchSampler). The difference was visible when I used num_workers > 0 and added torch.cuda.synchronize at the end of each loop, which confirms that the DataLoader workers fetch the data while the GPU is busy.

Thanks a lot @ptrblck!

1 Like

With this approach the batch_size in DataLoader gets defaulted to 1. The DataLoader will add an extra dimension of size 1 to the loaded data.
I found you could remove this by adding batch_size=None to the DataLoader.

loader = DataLoader(
    dataset,
    sampler=sampler,
    batch_size=None)

Then the DataLoader behaves similarly to when it does the batching itself, while retrieving one item at a time from the dataset.

loader = DataLoader(
    dataset,
    batch_size=10)

@Antonio_Ossa I filed Clarify the behavior of DataLoader sampler and batch_sampler parameters · Issue #71872 · pytorch/pytorch · GitHub for the documentation issue. Feel free to comment there as needed.