Reading image data through dataset class

israrbacha · July 8, 2022, 12:10pm

class RDataset(Dataset):
    def __init__(self, data_path, data_name, data_type, patch_size=None, length=None):
        super().__init__()
        self.data_name, self.data_type, self.patch_size = data_name, data_type, patch_size
        self.A_images = sorted(glob.glob('{}/{}/{}/rain/*.png'.format(data_path, data_name, data_type)))
        self.B_images = sorted(glob.glob('{}/{}/{}/norain/*.png'.format(data_path, data_name, data_type)))
        # make sure the length of training and testing different
        self.num = len(self.A_images)
        self.sample_num = length if data_type == 'train' else self.num

    def __len__(self):
        return self.sample_num

    def __getitem__(self, idx):
        image_name = os.path.basename(self.A_images[idx % self.num])
     A = T.to_tensor(Image.open(self.A_images[idx % self.num]))
     B = T.to_tensor(Image.open(self.B_images[idx % self.num]))
        h, w = A.shape[1:]
return A,B

I went through the above code and it works fine but my confusion is, after wrapping the dataset class in Datloader(RDataset, batch_size=10) how the value of idx is generated? my understanding is that idx will take values between 0 and len(self.num), and idx%self.num will always be zero means it will always refer to data sample at idx 0?

ptrblck · July 9, 2022, 12:44am

It’s generated by the sampler in the range [0, len(dataset)]. E.g. the default RandomSampler creates the indices here.

No, that should not be the case and seems to be used if self.sample_num (and thus the length of the dataset) doesn’t match the actual number of samples to avoid an out-of-bounds indexing operation:

A_images = torch.randn(10, 1)
sample_num = 20

for idx in range(sample_num):
    print(idx % len(A_images))
``
so it can be used for a repeated sampling procedure.