I have noticed that in some situations, a smaller batch size decreases error, but I want to minimize IO overhead for an epoch with many iterations.
It seems that the Dataloader(...,pin_memory=True)
option is designed to speed up batch loading. What if I know that most of my train/test data sets will fit on the device? Is there a way to cache the entire data set on the GPU, so that I can tune batch size as needed without suffering from increased IO overhead for small batches?