I’m using a custom Dataset
where the path to the images is created from a Pandas DataFrame. Out of curiosity I added a print(idx)
to the __getitem__
function and I noticed that it’s called twice (it prints two different indices) if I use 1 worker. If I use multiple workers, it’s called even more times. The batch size is 1, though.
Am I missing something? Shouldn’t I get just one image? Moreover, it returns just one image, independently of the number of workers (as it should be).
Could you please share the code?
It’s rather difficult to understand what it does without having the Pandas DataFrame (which I cannot share, I guess). But here’s the class:
class Data(Dataset):
def __init__(self, mode, df, img_dir, site, transform):
self.mode = mode
self.df = df
self.img_dir = img_dir
self.site = site
self.transform = transform
def path_channel(self, channel, idx):
experiment = self.df.loc[self.df.index[idx], 'experiment']
plate = self.df.loc[self.df.index[idx], 'plate']
well = self.df.loc[self.df.index[idx], 'well']
path = os.path.join(self.img_dir, experiment, f'Plate{plate}',
f'{well}_s{self.site}_w{channel}.png')
return path
def __getitem__(self, idx):
print(idx) # With 1 process and batch size 1, printed twice (different items)
# Iterate over channels of one image (from file)
all_channels = [np.array(Image.open(self.path_channel(ch, idx)),
dtype=np.float32) for ch in range(1, 7)]
img = np.stack([ch for ch in all_channels], axis=2)
if self.mode == 'train':
label = self.df.loc[self.df.index[idx], 'label'].astype('int32')
return img, label
elif self.mode == 'test':
return img
def __len__(self):
return self.df.shape[0]
Each worker will create a batch and call into your Dataset
's __getitem__
.
For num_workers=0
, the main thread will be used to create the batch. For num_workers=1
you will use another additional process to fetch the next batch.