I wrote a CustomImageDataset class, it now allows me to access image file names, but I now got another problem - it started to be way slower than when I used datasets.ImageFolder with torch.utils.data.DataLoader
Could you please advise what should I change to make it faster? (and one more problem I have is that I can’t use big batch size as it doesn’t fit in GPU memory despite that I do del my_vars and torch.cuda.empty_cache()
I don’t know how your custom Dataset works exactly as it seems a custom sampler would be needed, since __getitem__ expects two index values? If that’s the case, how many images are you loading? ImageFolder would load a single sample in the __getitem__ and I don’t know how you let it load two images so what’s the baseline you are comparing against?
To get some speedup you could use tensor = torch.from_numpy(arr) to share the underlying data instead of creating a copy via tensor = torch.tensor(arr).
I don’t know what my_vars is, but you might need to reduce the batch size to be able to train your model. Deleting unused tensors would reduce the memory requirement, but calling empty_cache() would slow down your code without avoiding the OOM issue.
I’m using two index values in __getitem__ to create a batch: data, labels, img_path = train_data.__getitem__(i, i + batch_size)
where I will be able to access image file names, so indices are just for batch slicing
the total number of images is around 80k (currently wikiart dataset, but later i will have to use 300k+ set of images) so I’m not able to keep the whole set of images together in memory, and I’m trying to find the most efficient way to load them by batches
my_vars it’s just all my variables that I am deleting after use (but it’s still not enough)
and seems like Im not able to use batch > 80 ( but when I load images with datasets.ImageFolder without image file names, I can easily use batch size 300 for example )
The loss.backward operation isn’t sped up or slowed down by the usage of another Dataset, but depends on the model architecture as well as input shapes.
If you’ve increased the input shapes in the new workflow, a slow down might be expected. Otherwise, I would guess you are timing your code wrong.
Your current __getitem__ approach would not leverage the DataLoader, which could pre-load the batches using multiprocessing. If you want to load an entire batch in each worker, use the BatchSampler instead.