Dataloader: load image once but use multiple times

Ozan · December 8, 2018, 11:26pm

Hi everyone,

Let’s say I have a large image that is the bottleneck of image loading process. Normally you set up your iterator and in getitem you specify how you get your images. I want to load this image once, but get 10 random crops. So I essentially want to do some yield loop inside the getitem function, but with pytorch so I can utilize batch sizes > 1 and use parallelism. How can I do this?

Thank you!

vmirly1 · December 9, 2018, 4:58am

I don’t think you can do that inside the __getitem__ function. But, if you create a list like self.data, then you can append multiple copies of this image to the list in the __init__. Then, in __getitem__ you retreieve items from the list self.data.

SimonW · December 9, 2018, 6:42am

You can add a cache to your dataset class. However, do note that if you are using multiprocessing dataloading (i.e., num_workers > 0), each worker process will get an independent cache, unless you do some interprocess connection like shared memory objects or a local db.