How to avoid sending unnecessary data to the dataloader's workers?

I have noticed that when using multiprocessing, the Dataloader seems to spend some time copying the data of its Dataset to the multi-processing workers. The following minimal script should make it pretty obvious:

import time
import torch
from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(torch.randn(100))
dataset.dummy_data = torch.randn(1000000000).tolist()
dataloader = DataLoader(dataset, num_workers=5, batch_size=10)

now = time.perf_counter()
for i, t in enumerate(dataloader):
    print(f"{time.perf_counter()-now:8.4f}s: step {i}")
    now = time.perf_counter()

On my computer, the typical output would be:

  3.6879s: step 0
  0.0001s: step 1
  0.0010s: step 2
[...]

You may see that the first data loading step is quite slow

If the dataset.dummy_data line is commented out, this would be the speed of the dataset:

  0.0716s: step 0
  0.0017s: step 1
  0.0000s: step 2
[...]

It makes sense that the DataLoader would need to distribute its Dataset out to the workers, but is there any way to exclude some fields of the Dataset from being copied if they don’t actually matter to the workers, such as dummy_data in the above case? (assuming those fields cannot just be deleted altogether due to other constraints)

I’m thinking that if Dataset is being pickled, something like this might work?
This issue seems related to my problem here, but I don’t think the copy-on-access explanation matches, since none of the workers should be trying to access dummy_data.

After further testing, I’ve noticed that the results are still the same, even when I replace

dataset.dummy_data = torch.randn(1000000000).tolist()
#   3.6879s: step 0

by:

dummy_data = torch.randn(1000000000).tolist()
#   3.6946s: step 0

To a lesser degree, the slowdown even happens when dummy_data is not a python list but a tensor (the length of the tensor needs to be much longer to observe a comparable slowdown, I assume due to how much denser the tensors are in RAM)

dummy_data = torch.full((20_000_000_000,), 1.5)
#   1.2990s: step 0

This is quite a big problem, since I definitely cannot delete all unrelated memory-intensive variable in my process.

If you are doing this:

dummy_data = torch.randn(1000000000).tolist()

It should mean dummy_data is not part of your Dataset, correct? Unless you are passing it to your Dataset object in a way that is not specified in your code.

In that case, the issue should not be related to Dataset.

That is exactly my point: the mere existence of dummy_data hinders the creation speed of the DataLoader, even without it being attached to the dataset. I don’t think this should be happening.

FYR, this is my up-to-date reproduction script (I’ve added the if __main__ block to avoid it getting executed in the workers, and made a few other small changes to make the test run faster overall):

import time

import torch
from torch.utils.data import DataLoader, TensorDataset


if __name__ == '__main__':

    big_quantity = 1_000_000_000

    dataset = TensorDataset(torch.randn(30))
    dummy_data = torch.full((big_quantity,), 1.5).tolist()

    dataloader = DataLoader(dataset, num_workers=5, batch_size=10)
    # dataloader.dummy_data = dummy_data
    now = time.perf_counter()
    for i, t in enumerate(dataloader):
        print(f"{time.perf_counter()-now:8.4f}s: step {i}")
        now = time.perf_counter()

Output:

  3.6229s: step 0
  0.0000s: step 1
  0.0005s: step 2

the mere existence of dummy_data hinders the creation speed of the DataLoader

That is how Python garbage collection works. In particular, that happens because you aren’t using dummy_data afterwards. Its ref count is 0 and can be GC’ed. Here is a pure Python illustration (without PyTorch).

import time
import random


if __name__ == "__main__":

    big_n = 300_000_000
    ls = [random.random() for _ in range(big_n)]
    now = time.perf_counter()
    for i, t in enumerate(range(3)):
        print(f"{i}: {time.perf_counter() - now:8.8f}s")
        now = time.perf_counter()

    # Having the following line can improve the performance of 0th iteration
    # Because it prevents GC of `ls`
    ls2 = ls

    # With `ls2 = ls`:
    # 0: 0.00006670s
    # 1: 0.00000180s
    # 2: 0.00000060s
    
    # Without `ls2 = ls`:
    # 0: 0.00250160s
    # 1: 0.00000140s
    # 2: 0.00000070s

Even in the absence of GC, it still makes sense for the first iteration to be slower, because you need to initialize the iterator and that has overhead. In the case of DataLoader, it involves creating new processes and copying datasets over, as well as many other things (you can read the source code here).

Thanks for your help. Your example is quite eye-opening. I was not aware that the GC could cause this kind of slowdowns (2 order of magnitude in your case).

However, I don’t think the GC accounts for the slowdown in my case, since it still happens even if I add a reference to dummy_data in my code after the time measuring sections (step 0 still takes 3.6228s).

I also completely understand and accept that the DataLoader needs time for initializing the iterator and will always cause overhead on the first step. I am bothered by how much the presence of dummy_data slows that initialization down (from 0.0780s to 3.6229s), while dummy_data should not matter to the DataLoader.

It is a mixture of GC, memory allocation, and copying data to new processes (especially if you are using fork if you are on Linux, try spawn instead).

Again, this is more related to Python than DataLoader.

I see.

I have tried out spawn and it was not able to give me satisfactory performance either (4.4960s: step 0).

I currently see 3 potential solutions to this problem:

  1. Stop using Python. I am not willing to do this at this stage since that would mean abandoning my entire code base
  2. Find a way to recycle the worker pool of a DataLoader so that it is initialized only once. I am aware that DataLoader’s persistent_workers flag does that, but it does not seem to allow updating the underlying dataset. As I see it, the best way to enable updating the dataset would be to re-write or extend the DataLoader class itself to enable this behavior, which would be pretty complex
  3. This one is more of an imperfect workaround: use a custom batch DataSampler, fed into DataLoader’s batch_sampler, so that the first loaded batch is of smaller size, reducing the wait time for the first load operation. I’ve implemented this and seen some limited improvements in my real-life use case

On 2, what kind of updates on the dataset do you need between epochs?

I am pretty much feeding in a brand new, completely different dataset (technically I’m using the same dataset object, but I have actually changed both its length and content).
This is for an inference use case