Loading Dating without a separate TRAIN/TEST Directory : Pytorch (ImageFolder)

dataloader · August 30, 2021, 2:44pm

My data is not distributed in train and test directories but only in classes. I mean:

image-folders/
   ├── class_0/
   |   ├── 001.jpg
   |   ├── 002.jpg
   └── class_1/
   |   ├── 001.jpg
   |   └── 002.jpg
   └── class_2/
       ├── 001.jpg
       └── 002.jpg

Is it the right way to approach the problem (What this does is: take datafolder and than divide it into train, valid and test sets. However, i am worried if it is the samething as valid/dev set even though “test set” will not go through training and validation loop):

    data = datasets.ImageFolder('PATH', transform)
    indices = list(range(len(data)))
    np.random.shuffle(indices)
    split = int(np.floor(valid_size * len(data)))
    train_idx, valid_idx = indices[split:], indices[:split]

    split = int(np.floor(test_size * len(valid_idx)))
    valid_idx, test_idx = indices[split:], indices[:split]

# define samplers for obtaining training and validation and test batches
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)
    test_sampler = SubsetRandomSampler(test_idx)

# prepare data loaders (combine dataset and sampler)
    train_loader = torch.utils.data.DataLoader(data, batch_size=batch_size, sampler=train_sampler, num_workers=num_workers)
    valid_loader = torch.utils.data.DataLoader(data, batch_size=batch_size, sampler=valid_sampler, num_workers=num_workers)
    test_loader = torch.utils.data.DataLoader(data, batch_size=batch_size, sampler=test_sampler  num_workers=num_workers)

ptrblck · August 31, 2021, 3:47am

The creation of the indices looks wrong and you could check if by using:

data = np.zeros(100)
valid_size = 0.2
test_size = 0.1
indices = np.arange(len(data))

split = int(np.floor(valid_size * len(data)))
train_idx, valid_idx = indices[split:], indices[:split]

split = int(np.floor(test_size * len(valid_idx)))
valid_idx, test_idx = indices[split:], indices[:split]

Here you can see that train_idx and valid_idx overlap, since you are using indices in the second split again.
The common approach would be to split the dataset into the training and validation indices first, and then split the validation indices into the final validation and test indices again.

dataloader · August 31, 2021, 7:42pm

Thanks for pointing out the mistake. I have fixed it below:

# creating a train / valid split
# valid set will be further divided into valid and test sets
indices = list(range(len(data)))
np.random.shuffle(indices)
split = int(np.floor(valid_size * len(data)))
train_idx, valid_idx = indices[split:], indices[:split]

# Creating a valid and test set
valid_idx = valid_idx[int(np.floor(0.2*len(valid_idx))) : len(valid_idx)]
test_idx = valid_idx[0:int(np.floor (0.2 * len(valid_idx) ) )]

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
test_sampler = SubsetRandomSampler(test_idx)

To clarify my doubt about creating sets like this in pytorch: Because test set will not go through the training and validation cycle; it is perfectly fine to create test set this way ?

Thanks;

ptrblck · August 31, 2021, 7:48pm

Yes, I don’t see why you wouldn’t create it like this. Your approach seems fine, as you are not reusing any indices anymore.