Dataloader resets dataset state

bfeeny · July 30, 2020, 4:01am

@ptrblck I am trying to utilize this caching method, but my pictures are now coming up blank, so I am obviously munging something somewhere. I am trying to cache images, so I am keeping things Numpy arrays, as I need to use torchvision transforms which means I can’t be a tensor before I do that pre-processing. My understanding is that, because we are pre-allocating the entire array, we can actually use all the workers we would like from the beginning.

The default use_cache is False, and I am just trying to get things to work in that state, where it places the image into the cache and then reads it back from cache. I realize that in normal flow I would set use_cache False for first epoch, to load the cache, and then afterwards can set to True since the cache should be built.

class MDataset(Dataset):
    def __init__(self, df: pd.DataFrame, imfolder: str, train: bool = True, transforms = None, meta_features = None, use_cache = False):

        self.df = df
        self.imfolder = imfolder
        self.transforms = transforms
        self.train = train
        self.meta_features = meta_features
        
        c=3
        h=param['image_size'][0]
        w=param['image_size'][0]
        
        shared_array_base = mp.Array(ctypes.c_float, len(self.df)*c*h*w)
        shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
        self.shared_array = shared_array.reshape(len(self.df), h, w, c)
        self.use_cache = False
        
    def __getitem__(self, index):
        if not self.use_cache:
            im_path = os.path.join(self.imfolder, self.df.iloc[index]['image_name'] + '.jpg')
            x = cv2.imread(im_path)
            x = cv2.cvtColor(x, cv2.COLOR_BGR2RGB)
            self.shared_array[index] = x
        x = self.shared_array[index]
        
        meta = np.array(self.df.iloc[index][self.meta_features].values, dtype=np.float32)

        if self.transforms:
            x = self.transforms(x)
            
        if self.train:
            y = self.df.iloc[index]['target'].astype("float32")
            return (x, meta), y
        else:
            return (x, meta)

    def set_use_cache(self, use_cache):
        self.use_cache = use_cache
    
    def __len__(self):
        return len(self.df)

Here is some other parts of my code so you can see what I am doing:

pytorch_dataset = MDataset(df=test_df, 
                            imfolder= f"{param['image_dir']}test", 
                            train=False,
                            transforms=transform,
                            meta_features=meta_features)
pytorch_dataloader = DataLoader(dataset=pytorch_dataset, batch_size=12, shuffle=True,  pin_memory=param['pin_memory'], num_workers=param['num_workers'])

images,meta = next(iter(pytorch_dataloader))

show_transform(torchvision.utils.make_grid(images, nrow=6), title="Random Images")

if I comment out x = self.shared_array[index] then everything works fine, as its using x from the file read from disk. But if I try to grab it from the cache, I don’t get any errors, I just get blank boxes where my image should be.