PyTorch Dataset/DataLoader classes


When training ResNet on ImageNet dataset, I coded some dataloading functionality by hand, which was extremely useful to me. I am currently transitioning from TF2 to PyTorch and I am very new to PyTorch Dataset and Dataloader classes. I am wondering whether PyTorch Dataset/DataLoader classes make the flow I coded by hand available out of the box. I did read PyTorch tutorials and API docs before posting the question.

Here is the problem:
ImageNet consists of ~ 1.3mln JPEG images, which take about 140Gb disk space. When compiling a batch, one needs to read a batch_size number of image files from the disk, and each of them needs to be pre-processed and this pre-processing is computationally expensive (load an image file of size NxM, randomly choose an integer in the interval [256, 480], re-size the image in the way that the shortest size is equal to this integer, crop randomly to extract a 224x224 square image, apply random color transformation to it, etc…). If this pre-processing could be done once and then used for all the epochs, it wouldn’t be a problem at all, but it needs to be re-done for each file each epoch (that’s how data augmenation is achieved). And training requires a large number of epochs (50-120).

Here is how I solved it:
I borrowed 30Gb from my RAM and made a buffer disk out of it. This space is enough to comfortably accomodate more that 1,000 batches (already pre-processed). I created a text file, monitoring the training progress (with two lines: current epoch number, current batch number), the training process updates this file after each 200 batches (which is equivalent of 1% of epoch with my batch size). Then I
wrote a run-time pre-processing script (both training and pre-processing run at the same time in parallel), which checks:

  1. where the training process currently is
  2. where the buffer currently starts (at which batch number)
  3. where the buffer currently ends (at which batch number)

If the pre-processing script sees that the training process went ahead and it is now safe to delete some batches from the start of the buffer, then it does so to free space. If the pre-processing script sees that the training process has less than 800 batches pre-processed and waiting to be fed to the training process, then it jumps into action, pre-processes more batches and places them at the end of the queue. Then it waits. It checks every 100 seconds whether there is any work for it, does it if there is, and then waits again. Pre-processing takes place 100% on the CPU and I could use multiple threads. This is important. Without it the pre-processing would not be able to work fast enough, and the main GPU training process would have to wait (which is unacceptable).

Can PyTorch Dataset/DataLoader classes provide the above functionality out of the box? If yes, I would appreciate if you could give a me push in the right direction. There is even no need to give me a code example (although that would be nice). Just tell me whether it is possible and where to look if it is.

Thank you!!

Please correct me if my understanding is wrong.

Your current solution is to have a second script that monitors the training script. If the second script notices that the training script is beginning to be bottlenecked by dataloading, it will jump into action and begins pre-processing data to be utilized. Is this correct?

If this is the case, then yes PyTorch does offer some functionality to make this easier. The argument I would point you towards is num_workers.

You can increase num_workers to spawn more processes to fetch the data (which involves pre-processing the data). If you’re finding that you cannot keep up with the training loop, then you can increase the number of workers. Assuming your pre-processing script is able to keep up, then you should achievable a similar affect with increasing the number of workers.

Please let me know if there’s something I misunderstood about this problem


Alex, thank you!

Yes, that’s right - the second script watches the training script and does whatever is required at the moment, which can be:

  1. add new data to the buffer by going through the following steps:
    1a) fetch new data
    2a) transform this data (heavy computation)
    3a) write this new transformed data to the buffer for the training script to consume
  2. delete used data from the buffer to free space

Can PyTorch DataLoader do all of the above? Including step 2?

Sort of. You mentioned a quite large buffer in your description (30Gb). In practice, PyTorch’s dataloader mostly relies on having enough workers to get you the data faster than your program can consume it. There’s still a buffer (each worker will prefetch 2 batches), but it’s much smaller and that’s okay.

Here’s how the steps will look like instead:

  1. Initialize Dataset whose item retrieval is lazy-load image, augment image, return result
  2. Initialize Dataloader with num_workers=X.
    a) Each worker will be allocated their own subset of the images to load
    b) Each worker will begin fetching data and building batches. They’ll load prefetch_factor number of batches in advance.
    c) They will return these batches to your main program to be used.
  3. If you’re finding that data loading is still a bottleneck, increase the num_workers further

Each worker is it’s own independent process that will go and get data. Assuming you have the computation power, you can always add more workers. (Normally you wouldn’t do more workers that say processes, but this requires a little bit of testing)

Let’s assume that your program is consuming datapoints faster than we can load them. Then we can increase the number of workers. If this fixes it, fantastic. Let’s suppose it doesn’t. Then you need to look for bottlenecks elsewhere. For example, disk read could be your bottle neck. In that situation, increase the prefetch_factor or num_workers won’t really help because they’re all attempting to use the same resources. There’s still ways to remedy this, but I won’t really get into it because it’s likely that just increasing the number of workers will be sufficient (but let me know if you’re still encountering a bottleneck and we can go over other methods).


Thanks a lot!!

That’s kind of answer I was hoping for.

You were a great help.