Does DataLoader preload future batches? If yes, how to turn it off?

Hi! I am working on a simple classification problem. However, in my setup, I would like to create batches smarter than just by uniform sampling. Namely, I am trying to mine hard batches as following:

  1. sample a big batch uniformly (e.g. 1024 samples)
  2. apply my model to the big batch and calculate losses
  3. sample a normal batch (e.g. 128 samples) out of the big batch using multinomial distribution parameterized by the losses from step 2

This procedure depends on the model and the model changes after every batch. Consequently, I do not want to preload future batches. In other words, I want the method __next__ of dataloader.batch_sampler to be called only when the same method of dataloader itself is explicitly called. For example, in the following loop I would like the next batch to be created only after do returns

for batch in dataloader:
    do(batch)

I know that I can achieve that by setting num_workers=0. But what is the behavior if I set num_workers=some_positive_integer ? Does Dataloader preload future batches? If yes, is it possible to avoid it anyhow?

P.S. My final goal is to create my custom BatchSampler and the question, if answered, will help to understand if it is possible to yield batches instead of precalculating them for the whole epoch.

If you set num_workers > 0, each worker will load a batch in the background and inplace modifications of your Dataset won’t be noticed until the next epoch.
As you said, num_workers=0 might work and I can’t see any reason to use multiple workers, as your workflow seems to be sequential, i.e. you can just sample the next batch after do finishes.
So even if you would use multiple workers, all would have to wait until your loop finishes.

2 Likes

Thank you for the response! I was not sure about batch-level parallelization, I thought that num_workers > 0 may boost even construction of one batch. Now I understand that it works in a different way.

@ptrblck could I get a little more info on how this works? I’m assuming __getitem__ eventually gets called on the dataset so as long as what gets returned there is the desired result then preprocessing is occurring.

One other question, the preparation step in my application is extensive and unavoidable. I’m thinking about using multiple threads to help the word go faster. Do I need to worry about how these processes affect what Pytorch is doing in preparing the data? (e.g. Pytorch thinks a thread isn’t being used when really I’ve just kicked up a process on it.)

Usually you would add the proprocessing in the __getitem__ and would return the already preprocessed batch of samples. Note that this thread had a specific requirement:

Consequently, I do not want to preload future batches. In other words, I want the method __next__ of dataloader.batch_sampler to be called only when the same method of dataloader itself is explicitly called.

So I’m unsure if you are working with a similar requirement or if it’s a general question.

You should be able to use multi-threading inside a process, i.e. in each worker of your DataLoader. Could you describe your concern a bit more, please? I.e. are you seeing any specific issues in your current use case?

1 Like