Hello I think this question will be so basic but I need some help to clarify. I learn how people doing training in machine learning. They spesifically devide training data into train and validation set. Then I see this transfer learning tutorial Tutorial which provide the code like this.
def train_model(model, criterion, optimizer, lr_scheduler, num_epochs=25):
since = time.time()
best_model = model
best_acc = 0.0
for epoch in range(num_epochs):
print('Epoch {}/{}'.format(epoch, num_epochs - 1))
print('-' * 10)
# Each epoch has a training and validation phase
for phase in ['train', 'val']:
if phase == 'train':
optimizer = lr_scheduler(optimizer, epoch)
model.train(True) # Set model to training mode
else:
model.train(False) # Set model to evaluate mode
running_loss = 0.0
running_corrects = 0
# Iterate over data.
for data in dset_loaders[phase]:
# get the inputs
inputs, labels = data
# wrap them in Variable
if use_gpu:
inputs, labels = Variable(inputs.cuda()), \
Variable(labels.cuda())
else:
inputs, labels = Variable(inputs), Variable(labels)
# zero the parameter gradients
optimizer.zero_grad()
# forward
outputs = model(inputs)
_, preds = torch.max(outputs.data, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.data[0]
running_corrects += torch.sum(preds == labels.data)
epoch_loss = running_loss / dset_sizes[phase]
epoch_acc = running_corrects / dset_sizes[phase]
print('{} Loss: {:.4f} Acc: {:.4f}'.format(
phase, epoch_loss, epoch_acc))
# deep copy the model
if phase == 'val' and epoch_acc > best_acc:
best_acc = epoch_acc
best_model = copy.deepcopy(model)
print()
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(
time_elapsed // 60, time_elapsed % 60))
print('Best val Acc: {:4f}'.format(best_acc))
return best_model
In this tutorial I realized that the validation data is only monitoring the good accuracy for “unseen” data. So if I am using test data as validation data, is it good for doing so?
I also tracking issue from keras in here . Most of people said because the validation is not accumulated during training and just monitor the accuracy for best model, some test data can be used as validation. Is it correct?, how do it in the best way in pytorch?
@QuantScientist Thank you so much, it really help me a lot. So why we must use validation data which must come from train data instead of taking test data as validation data?
In most cases, test data is not labeled (this is usually the case for all the Kaggle competitions). Therefore, training and more importantly, cross validation (CV), has to be conducted solely on the training set.
Read about k-fold CV here:
Your model adapts to the training data and might overfit, so we test it on validation data in order to check whether it is overfitting.
So we train, then we validate, then we revise the model and we repeat the process. But it is possible for the train-validation-revision loop to pick up an undiscovered bias present in both the training and validation datasets that allows good performance on both, but bad generalisation.
That is why it is always a good idea to have a test dataset distinct from the training and validation datasets in order to infrequently verify whether our models generalise beyond the training and validation sets.
@QuantScientist Kaggle must have labelled test data because otherwise they would be unable to calculate the test leaderboard. They just don’t show us the test dataset.
I used the method you suggested. I am able to split the dataset but not able to use the dataloaders. I tried using dataloaders as follows
for i_batch, sampled_batch in enumerate(dataloaders[‘train’]):
print(i_batch)
This code generally works with dataloader but throws an error with dataloaders[‘train’]. The error is shown below:
Type Error: len() of unsized object
But how exactly do we apply separate data transforms for training and validation sets in your approach? And should we do so?
What I am trying to achieve is to find a proper way to make below data transforms* “visible” to the subsets generated by random_split. (*transforms taken from Pytorch’s transfer learning tutorial)