@ptrblck I am trying to utilize this caching method, but my pictures are now coming up blank, so I am obviously munging something somewhere. I am trying to cache images, so I am keeping things Numpy arrays, as I need to use torchvision transforms which means I can’t be a tensor before I do that pre-processing. My understanding is that, because we are pre-allocating the entire array, we can actually use all the workers we would like from the beginning.
The default use_cache
is False, and I am just trying to get things to work in that state, where it places the image into the cache and then reads it back from cache. I realize that in normal flow I would set use_cache
False for first epoch, to load the cache, and then afterwards can set to True since the cache should be built.
class MDataset(Dataset):
def __init__(self, df: pd.DataFrame, imfolder: str, train: bool = True, transforms = None, meta_features = None, use_cache = False):
self.df = df
self.imfolder = imfolder
self.transforms = transforms
self.train = train
self.meta_features = meta_features
c=3
h=param['image_size'][0]
w=param['image_size'][0]
shared_array_base = mp.Array(ctypes.c_float, len(self.df)*c*h*w)
shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
self.shared_array = shared_array.reshape(len(self.df), h, w, c)
self.use_cache = False
def __getitem__(self, index):
if not self.use_cache:
im_path = os.path.join(self.imfolder, self.df.iloc[index]['image_name'] + '.jpg')
x = cv2.imread(im_path)
x = cv2.cvtColor(x, cv2.COLOR_BGR2RGB)
self.shared_array[index] = x
x = self.shared_array[index]
meta = np.array(self.df.iloc[index][self.meta_features].values, dtype=np.float32)
if self.transforms:
x = self.transforms(x)
if self.train:
y = self.df.iloc[index]['target'].astype("float32")
return (x, meta), y
else:
return (x, meta)
def set_use_cache(self, use_cache):
self.use_cache = use_cache
def __len__(self):
return len(self.df)
Here is some other parts of my code so you can see what I am doing:
pytorch_dataset = MDataset(df=test_df,
imfolder= f"{param['image_dir']}test",
train=False,
transforms=transform,
meta_features=meta_features)
pytorch_dataloader = DataLoader(dataset=pytorch_dataset, batch_size=12, shuffle=True, pin_memory=param['pin_memory'], num_workers=param['num_workers'])
images,meta = next(iter(pytorch_dataloader))
show_transform(torchvision.utils.make_grid(images, nrow=6), title="Random Images")
if I comment out x = self.shared_array[index]
then everything works fine, as its using x
from the file read from disk. But if I try to grab it from the cache, I don’t get any errors, I just get blank boxes where my image should be.