Best practices for network bottlenecked image data loading + transforms

I am loading my image data in a way where I do download and transform under the same Dataset fetch call:

class MyImageSet(Dataset):
   ...
    
   def __getitem__(self, idx):
        file_key = self.file_keys[idx]
        local_file_path = self.download_file(file_key)

        image = Image.open(local_file_path)
        image = np.array(image)
        image = self.image_transform(image)

        return image

Thing is, I am bottlenecked by network so I feel like I could speed things up by not having the worker wait on running a transform before starting the next download.

Is there a convenient way to do this with the Dataset/DataLoader APIs directly?

I’d first recommend having a separate method called when init’ing the Dataset to initially download the files instead of doing it on the fly, but another work around would be to avoid downloading each file at the very least:

from skimage import io as io

...
class MyImageSet(Dataset):
   ...
    
   def __getitem__(self, idx):
        file_key = self.file_keys[idx]

        # local_file_path = self.download_file(file_key)
        # image = Image.open(local_file_path)

        image = io.imread(file_key)

        image = np.array(image)
        image = self.image_transform(image)

        return image

But if you still need to download the files, you could write and read from the server’s SSD as that will be faster.

The image transforms shouldn’t be the bottleneck.

This post might be helpful in addition.