Guidelines for assigning num_workers to DataLoader

sbelharbi · May 1, 2019, 9:50pm

I think SSD discs will speedup disc I/O, but it won’t solve your problem entirely since you still need at some point to do disc access.

I never came across this issue, but here some pointers:

Use 1 dataset, 1 dataloader over the entire data. Load only samples when needed (i.e., the minibatch). After processing the minibatch, delete it. This goes under the risque that the concurrence between the process will create a situation where the processes load faster the minibatches and saturate the RAM. Moreover, you need to access to disc for every sample every time.
Split randomly the entire data into BIG chunks (k chunks), at every epoch. Then, load a chunk into memory using torch.utils.data.Dataset and torch.utils.data.DataLoader. Process this chunk as any dataset. Then, delete the dataset and the dataloader. The advantage is that you can control the number of samples loaded into the memory to avoid overloading it. Something like this:

size_chunk = 10000
nbr_chuncks = 70
for i in range(nbr_chunks):
    # Create a dataset that contains only the current chunk.
    # This requires disc access, and your SSD can boost the speed.
    # However, this remains an issue since you will need to reload EVERY SAMPLE EVERY TIME.
    dataset_i = torch.utils.data.Dataset(chunk_i)
   # Create a dataloader that loads only this chunk, and splits it into minibatches.
   dataloader_chunk_i = torch.utils.data.DataLoader(dataset_i)
   # do your training over the current minibatches:
   for j, (data, label) in enumerate(datalaoder_chunk_i):
       # Process this minibatch: forward data, compute loss, update params, and all that.
  # Now you are done with this chunk. Get rid of it from the memory: Delete it to free the memory.
  del dataset_i
  del dataloader_chunk_i

As you can see this is problematic due to the limited space of the memory (RAM). Another way that you can try while avoiding the above annoying chunking process and keep the standard data loading is to find a workaround to load the entire dataset into memory!!! you can try to compress samples on disc. Then, load all samples into memory, and keep them compressed. Uncompress a sample ONLY when NEEDED. Once you finished working on that sample, delete the uncompressed version, and keep only the compressed one. At any time, only the minibatch size of samples is decompressed, while everything is compressed to preserve the memory space. This can be a solution if the required time to read from disc is slower than decompressing a compressed sample.

Please let me what option works better for you!