Best way training data in pytorch


(Herleeyandi Markoni) #1

Hello I think this question will be so basic but I need some help to clarify. I learn how people doing training in machine learning. They spesifically devide training data into train and validation set. Then I see this transfer learning tutorial Tutorial which provide the code like this.

def train_model(model, criterion, optimizer, lr_scheduler, num_epochs=25):
since = time.time()

best_model = model
best_acc = 0.0

for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
    print('-' * 10)

    # Each epoch has a training and validation phase
    for phase in ['train', 'val']:
        if phase == 'train':
            optimizer = lr_scheduler(optimizer, epoch)
            model.train(True)  # Set model to training mode
        else:
            model.train(False)  # Set model to evaluate mode

        running_loss = 0.0
        running_corrects = 0

        # Iterate over data.
        for data in dset_loaders[phase]:
            # get the inputs
            inputs, labels = data

            # wrap them in Variable
            if use_gpu:
                inputs, labels = Variable(inputs.cuda()), \
                    Variable(labels.cuda())
            else:
                inputs, labels = Variable(inputs), Variable(labels)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward
            outputs = model(inputs)
            _, preds = torch.max(outputs.data, 1)
            loss = criterion(outputs, labels)

            # backward + optimize only if in training phase
            if phase == 'train':
                loss.backward()
                optimizer.step()

            # statistics
            running_loss += loss.data[0]
            running_corrects += torch.sum(preds == labels.data)

        epoch_loss = running_loss / dset_sizes[phase]
        epoch_acc = running_corrects / dset_sizes[phase]

        print('{} Loss: {:.4f} Acc: {:.4f}'.format(
            phase, epoch_loss, epoch_acc))

        # deep copy the model
        if phase == 'val' and epoch_acc > best_acc:
            best_acc = epoch_acc
            best_model = copy.deepcopy(model)

    print()

time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(
    time_elapsed // 60, time_elapsed % 60))
print('Best val Acc: {:4f}'.format(best_acc))
return best_model

In this tutorial I realized that the validation data is only monitoring the good accuracy for “unseen” data. So if I am using test data as validation data, is it good for doing so?

I also tracking issue from keras in here . Most of people said because the validation is not accumulated during training and just monitor the accuracy for best model, some test data can be used as validation. Is it correct?, how do it in the best way in pytorch?

-Thank you-


(Solomon K ) #2

No, its not good.
You should split the training set into a train and validation sets. I show how to do it here:
(https://github.com/QuantScientist/Deep-Learning-Boot-Camp/blob/master/day%2002%20PyTORCH%20and%20PyCUDA/PyTorch/21-PyTorch-CIFAR-10-Custom-data-loader-from-scratch.ipynb)

class FullTrainingDataset(torch.utils.data.Dataset):
def init(self, full_ds, offset, length):
self.full_ds = full_ds
self.offset = offset
self.length = length
assert len(full_ds)>=offset+length, Exception(“Parent Dataset not long enough”)
super(FullTrainingDataset, self).init()

def __len__(self):
    return self.length

def __getitem__(self, i):
    return self.full_ds[i+self.offset]

def trainTestSplit(dataset, val_share=TEST_RATIO):
val_offset = int(len(dataset)*(1-val_share))
return FullTrainingDataset(dataset, 0, val_offset), FullTrainingDataset(dataset, val_offset, len(dataset)-val_offset)

train_ds, val_ds = trainTestSplit(dset_train)

train_loader = torch.utils.data.DataLoader(train_ds, batch_size=BATCH_SIZE,
shuffle=True, num_workers=1,pin_memory=PIN_MEMORY)

val_loader = torch.utils.data.DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=1,
pin_memory=PIN_MEMORY)


(Herleeyandi Markoni) #3

@QuantScientist Thank you so much, it really help me a lot. So why we must use validation data which must come from train data instead of taking test data as validation data?


(Solomon K ) #4

In most cases, test data is not labeled (this is usually the case for all the Kaggle competitions). Therefore, training and more importantly, cross validation (CV), has to be conducted solely on the training set.
Read about k-fold CV here:


(Deepak Sharma) #5

Train set can be divided into train and validation set by using random_split method of torch.utils.data.dataset.

Importing random split method

from torch.utils.data.dataset import random_split

from torchvision import datasets

Providing train test set dir paths for creating dataset class objects and applying transformations

mage_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in [‘train’, ‘test’]}

Creating variable for holding original train dataset len

train_dataset_len = image_datasets[‘train’].len()

Creating 50 50 train/validation splits out of original train set

image_datasets[‘train’], image_datasets[‘val’] = random_split(image_datasets[‘train’],[train_dataset_len//2, train_dataset_len-dataset_len//2] )

Creating DataLoader objs for splits so that splits can be used for training, validation and testing

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=100,
shuffle=True, num_workers=3) for x in [‘train’, ‘val’, ‘test’]}


(jpeg729) #6

Your model adapts to the training data and might overfit, so we test it on validation data in order to check whether it is overfitting.

So we train, then we validate, then we revise the model and we repeat the process. But it is possible for the train-validation-revision loop to pick up an undiscovered bias present in both the training and validation datasets that allows good performance on both, but bad generalisation.

That is why it is always a good idea to have a test dataset distinct from the training and validation datasets in order to infrequently verify whether our models generalise beyond the training and validation sets.

@QuantScientist Kaggle must have labelled test data because otherwise they would be unable to calculate the test leaderboard. They just don’t show us the test dataset.