Overall memory usage using multi-process data loading with Subset

Hello,

From the warning note in the multi-process data loading multi-process data loading section, the overall memory usage when using multiple workers is number of workers * size of parent process.

When using torch.utils.data.Subset, does size of parent process correspond to the original dataset or to a subset of the original dataset?

In other words, when splitting a Dataset in number of subsets Subset and loading them using DataLoader with number workers each, is the overall memory usage number of workers * size of parent process.or number of subsets * number of workers * size of parent process?

Each worker will create a copy of the Dataset. The size of the Dataset is defined by the __init__ method and if you are lazily loading each sample (i.e. if you are only storing the paths in the __init__), the additional memory usage would be tiny. The Subset will only wrap the Dataset and pass the provided indices to it. It won’t change the actual Dataset.__init__ method and the lazy/eager loading logic will still be applied.

Thank you for your answer.

So the overall memory would indeed scale linearly with the number of subsets but with lazy loading the size of the original Dataset can be shrunk at the cost of the time to read from disk at each Dataset.__getitem__ call?

No, since a Subset is only wrapping a Dataset and passes indices to it. The memory increase would come from storing the additional indices.

Yes, lazily loading the data will only use memory to store the paths etc. while initializing the dataset.

Do you have a script, or ideas on how to write one, to monitor the memory of all the data loading processes for educational purpose?

I tried adapting this blog article and code:

import torch
from torch.utils.data import Dataset, Subset, DataLoader
from common import MemoryMonitor

class ToyDataset(Dataset):
  def __init__(self, shape):
    self.data = torch.zeros(shape)

  def __len__(self):
    return self.data.shape[0]

  def __getitem__(self, idx):
    return self.data[idx]

shape = ... # tuple, (num samples, height, width)
dataset = ToyDataset(shape)

dataloader = DataLoader(dataset, batch_size=32, num_workers=4, persistent_workers=True)
it = iter(dataloader)
monitor = MemoryMonitor()
[monitor.add_pid(w.pid) for w in it._workers]

print(f"Single Dataset, {dataloader.num_workers} workers\n", monitor.table())

From the blog above:

By definition, we should use total PSS to count the total RAM usage of N processes.

For the same number of worker the memory used by each worker scales with the size of the data as expected.

I’m not sure how to interpret the outputs when keeping the size of the dataset constant and varying the number of workers e.g. from num_workers=2 and num_workers=4:

# For a dataset of shape (512, 1024, 1024) and batch size 32

Single Dataset, 2 workers
   time    PID  rss    pss     uss     shared    shared_file
------  -----  -----  ------  ------  --------  -------------
 21097  98405  2.4G   930.6M  157.1M  2.3G      43.4M
 21097  98598  2.4G   889.7M  119.4M  2.3G      31.0M
 21097  98600  2.4G   906.9M  136.3M  2.3G      31.0M
# For a dataset of shape (512, 1024, 1024) and batch size 32

Single Dataset, 4 workers
   time     PID  rss    pss     uss     shared    shared_file
------  ------  -----  ------  ------  --------  -------------
 21996  102287  2.4G   620.7M  158.5M  2.2G      43.4M
 21996  102488  2.4G   583.5M  126.2M  2.2G      31.3M
 21996  102490  2.4G   598.1M  140.6M  2.2G      31.5M
 21996  102492  2.3G   547.9M  90.3M   2.2G      31.6M
 21996  102494  2.3G   566.1M  108.3M  2.2G      31.6M

The memory usage per worker seems to decrease with increasing number of workers. Do you know if this is expected or if it’s a issue with my example?