Dataloader to load batches into predefined buffer directly

Can the pytorch dataloader be configured to write data into a predefined buffer? Instead of dynamically allocating memory for each batch, each batch could be written into a buffer and read from there. Is this possible currently? Can this be done even when num_workers > 0?

You could take a look at this post showing the usage of multiprocessing.Array, which might fit your use case.

Thank you for your suggestion! I tried this but this does not directly solve my use case. What I would want is essentially the Dataloader to not dynamically create a tensor for each batch, but write each batch into a predefined buffer.

If my loader looks like this:

loader = DataLoader(
    dataset,
    num_workers=7,
    shuffle=False
)
loader_iter = iter(loader)
buffer  # size of this is 2*num_workers

next(loader_iter) # this should write the new batch into the buffer directly

Is it possible to have a predefined buffer where each worker writes a batch into a position in the buffer?

You could try to create the buffer externally, pass it to the Dataset, and write the data to it from each worker (making sure that the workers don’t overlap). I guess the difference between my approach and your suggestion would be to initialize the shared array outside of the Dataset and pass it as an argument to it.

My dataset is too huge to hold the entire data in the buffer, so I would like to hold only intermediate batches returned by the dataloader in the buffer.

If I have static buffer, can I provide a custom collate function to the datalaoder, which writes the batch into a buffer directly?
If so, do I need to do special handling for cases where num_workers > 0?

Would you like to create a caching mechanism for e.g. 10% of the data and reuse it in the next epoch? If you are filling (and clearing) the cache with new samples in each iteration, I would expect to see mostly misses, since your would basically implement a running window which doesn’t contain the next sample.

I think that’s possible, since my example does the same inside the Dataset.__getitem__.

Yes, I also think you would have to reuse my approach of using the shared array.

I have implemented a version of this:

MyCollator:
def __init__(self, batch_size, height, width, channels):
        self.buffer = torch.zeros(8, self.batch_size, self.channels, self.height, self.width, dtype=torch.bfloat16)
        self.curr_idx = -1

def __call__(self, batch):
        elem = batch[0]
        elem_type = type(elem)

        if isinstance(elem, torch.Tensor):
            self.curr_idx = (self.curr_idx + 1) % 8
            torch.stack(batch, 0, out=self.buffer[self.curr_idx])
            return self.buffer[self.curr_idx]

collator = MyCollator(self.batch_size, self.in_height, self.in_width, 3)
loader = DataLoader(dataset,
batch_size=self.batch_size,
drop_last=True,
num_workers=self.num_workers,
collate_fn=collator)

This is however returning the same tensor over multiple batches, what could be the issue here?

Also, each worker gets its own copy of the collator, so there should not be any overwriting of the buffer?