[Data Loader] What is the appropriate way to load dataset with >1M images?

I have a large dataset with >1M images, and i write a custom dataset like this

class MocoDataset(datasets.VisionDataset):

  def __init__(self, filepath_file=None, transform=None):
    list_of_paths = []
    with open(filepath_file, "r") as f:
      for line in f:
        line = line.rstrip("\n")
    self.list_of_paths = np.array(
    )  # Here create a list of all image paths / paths..
    self.transform = transform
    self.loader = pil_loader

  def __len__(self):
    return len(self.list_of_paths)

  def __getitem__(self, idx):
    image_path = self.list_of_paths[idx]  # Gives the path to an image
    image = self.loader(image_path)
    # use 1 as a pseudo label for target
    return self.transform(image), 1

but my program got errored out like this

start training
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workdir/moco/main_moco.py", line 671, in <module>
  File "/workdir/moco/main_moco.py", line 312, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
...:/workdir$ /opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 68 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I used 32 workers, and batch size of only 2. I am pretty sure it is due to the dataset size because if I only load the first 10k image paths in my init method. the training starts without problem


  1. seeking some suggestions of how to implement this efficiently…
  2. if my batch is 2, there will only be 2 images loaded to GPU each time?
  3. even if loader is doing prefetching, it won’t try to get all the items for an epoch into CPU memory right?


actually, it seems working after I reduce workers to 2. So if too many workers it could kill the hosting process? so if I would like to use more workers I need to use more GPUs?