Train and validate simultaneously on two datasets

aleurs · June 1, 2018, 10:29am

I have two datasets each of 142.000 images, one for training and one for validation. What I have implemented is that it goes through the train data set (batchsize 11) and trains.
But what I now want to do is the following:

train on the first images 11 images of the train set
validate the model on the first 11 images from the validation set
then train on the second 11 images of the train set
then validate the model on the second 11 images from the validation set
etc.

training_data_loader = DataLoader(dataset=train_set, num_workers=4, batch_size=11)
val_data_loader = DataLoader(dataset=val_set, num_workers=4, batch_size=11)

Here is my trainings code. Here I enumerate over the training_data_loader. Is there an efficient way to get the corresponding batch from the val_data_loader?
I tried a for loop that iterates through the val_data_loader and gets me the correct batch, but that’s too time consuming. I have to directly access the correct batch from the val_data_loader. I tried it with the function getitem, but I didn’t found a solution that worked.

for iteration, batch in enumerate(training_data_loader, 0):
		input_train, target_train = Variable(batch[0]), Variable(batch[1])

		if cuda:
			input_train = input_train.cuda()
			target_train = target_train.cuda()
            
		optimizer.zero_grad()
        
		output_model_train = model(input_train)

		loss = criterion(output_model_train, target_train)
        
		loss.backward()
		optimizer.step()

justusschock · June 1, 2018, 10:36am

First of all: Do you really need this? Usually The validation is done after each epoch (epoch = whole iteration over dataset) on the whole validation dataset.

if you really want to do it the way you did it you could define a Dataset which handles train and validation data. If you do this you could use one dataloader and would get batches of always the same indices in train and validation set.

NOTE: This could lead to problems if the datasets have different sizes or you could validate the same images multiple times if you use shuffle=True (as usually done during training)

ptrblck · June 1, 2018, 10:39am

If your Datasets have the same sizes, which seems to be the case, you could just iterate both DataLoaders in the for loop.
Note that the indices for each loader might be different if you use shuffling.

train_dataset = TensorDataset(torch.randn(100, 3, 24, 24))
val_dataset = TensorDataset(torch.randn(100, 3, 24, 24))

train_loader = DataLoader(train_dataset, batch_size=11, shuffle=False)
val_loader = DataLoader(val_dataset, batch_size=11, shuffle=False)

for batch_idx, (train_data, val_data) in enumerate(zip(train_loader, val_loader)):
    print('Batch idx {}\nTrain data shape {}\nVal data shape {}'.format(
        batch_idx, train_data[0].shape, val_data[0].shape))

aleurs · June 1, 2018, 10:45am

Yes, in the paper I’m implementing they calculate the loss after each iteration in the data set. If I would do it each epoch it would take too long.
The train and validation datasets have the same size. Thanks for the tipp with shuffle.

aleurs · June 1, 2018, 10:51am

Thanks. That is what I was looking for :-).

prashp · August 30, 2019, 7:06pm

Sorry to revive this old thread, but what is the solution if the Datasets don’t have the same size?

Currently I’m using this approach:

for epoch in range(epochs):
   val_iter = itertools.cycle(validation_dataset_loader)
   for i_batch, data in enumerate(training_dataset_loader, 0):
      model.train()
      #training happens here
      
      model.eval()
      val_sample = next(val_iter)
      #evaluate performance on validation data

However I’m not sure if using the iterator here for the validation dataset is the best option, compared to the ‘enumerator’ approach, especially because the validation dataset is shuffled even when I explicitly set ‘shuffle=False’ when creating the validation DataLoader.

Thanks!

ptrblck · August 30, 2019, 7:12pm

This approach seems to be valid for your use case.
The shuffling should not be changed, if you wrap the DataLoader into cycle:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(100).view(100, 1)
        
    def __getitem__(self, index):
        x = self.data[index]
        return x
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=5,
    shuffle=False
)

my_iter = itertools.cycle(loader)

for _ in range(100):
    x = next(my_iter)
    print(x)