I am working on a project with histopathological images (called Whole-Slide Image). Each of these images is ~1 GB, so they are really hard to handle.
Particularly I struggle when I use
DataLoader(num_worker=N) (where N>1) because PyTorch starts loading and preprocessing (slight data augmentation in our case) multiple batches in RAM and then the RAM fills up fast.
I wanted to know if there are other people working on implementing an alternative
DataLoader mechanism that could allow us to have multiple workers working on the same batch.
I would also like to know if you have any suggestions regarding this topic.
Since I never opened a Pytorch PR and since I noticed that worker shutdown/handling is a very delicate matter, do you think I could open a draft and then someone could provide some suggestions and support?
This feature request is tracked here and I’m sure contributions are welcome, so please feel free to post your interest in the issue and code owners will follow up with you.