Data of Many Small Numpy Arrays


I have a dataset that consists of many small numpy array. The total size of it is around 300 GB.
When loading, each call to the Dataset’s getitem, I read a file and return its content. However, the training takes long time. Each time around 15 batches are processed, and then there is a minute of GPU idle time.
How can I optimize this pipeline?

It seems like you may be running out of memory. Have you been tracking your memory usage?

  1. Are you storing one sample per file? If there are multiple samples per file, you may be reading more than you need.
  2. You may also want to use a lower number of workers (if it is not 0 already) because the memory consumption increases with the number of workers being used.