Creating custom image classes for .npy - massive data loading

lpinto · June 27, 2018, 8:44pm

I am trying to train on around 200G of .npy files. I have a custom image class:

class CustomImageFolder(ImageFolder):
def __init__(self, root, transform=None):
    super(CustomImageFolder, self).__init__(str(root),transform)

def __getitem__(self, index):
    path = self.imgs[index][0]
    img = np.load(path)
    img /= 255 # normalization
    return img
    root = Path(dset_dir).joinpath('ZebraFish/train/')
    transform = None
    train_kwargs = {'root':root, 'transform':transform}
    dset = CustomImageFolder


train_dataset = dset(**train_kwargs)
train_loader = DataLoader(dataset=train_dataset,
                          batch_size=batch_size,
                          shuffle=True,
                          num_workers=num_workers,
                          pin_memory=True,
                          drop_last=True)

I’m getting the following error: RuntimeError: Found 0 files in subfolders of: data/ZebraFish/train
Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif.

I see that the default loader function will create a PIL object. Although since I’m working with .npy is there a simple way around this?

Is there a way to make the dataLoader have this same massive functionality with .npy files?
All the best

ptrblck · June 27, 2018, 10:01pm

Sure! You don’t need to inherit from ImageFolder.
Just create your own Dataset and load your numpy arrays as you want:

class MyDataset(Dataset):
    def __init__(self, root, transform=None):
        self.image_paths = os.glob.(... # get your numpy array paths here

    def __getitem__(self, index):
        img = np.load(self.image_paths[index])
        ...
    
    def __len__(self):
        return len(self.image_paths)