Creating a Dataloader Buffer

As I see it, the standard use of the DataLoader class is a series of operations:

  1. Call the dataloader.
  2. It performs some loading operations and returns the result.
  3. Then, the result of the dataloader is used for some operations by the main code.

My issue with this is that the loading operations are blocking and take sometimes significant portions of time. Conceivably, though, the loading operation could be performed ahead of time such that the result was stored in memory associated with the dataloader. Then, when the dataloader was called, the result could simply be returned with no processing (already in memory). Then, the next batch of data could be processed and loaded into memory WHILE THE MAIN CODE WAS EXECUTING, IN PARALLEL.

  1. Call the dataloader, and it immediately returns result stored in buffer
  2. Result of dataloader is used by main code, and simultaneously a separate process loads next data into the buffer

This should be possible using multiprocessing, essentially creating a buffer. However, using the multiprocessing functionality of the DataLoader I don’t think it is.

Does anyone know of a way to do this cleanly using existing Pytorch libraries? I’m confident I can do it with python’s multiprocessing module but I’d prefer to avoid that since working with tensors across mutiple processes can get a bit hairy.

Why do you think it is not working in the current DataLoader implementation?
The num_workers argument will specify the number of processes, which will load and process each batch in the background, while the main training loop is busy.

1 Like

So basically, there is a buffer implementation when using Dataloaders, right?

Each process will load an entire batch and add it to a queue. Assuming you are seeing this as a buffer, then yes.

1 Like

Maybe what you want is something like the @functools.lru_cache() decorator.

You could try putting it in your dataset, right before the __get_item___(self, index) method.

Here is the documentation for the decorator.

Here is a book with an example of how to use it. (Chapter 10.2 - Listing 10.2 - Book: Deep Learning with PyTorch)