Pretrained model is giving high accuracy on test set

Hi pytorch,
I have a question regards a project that I am working on. Basically I am using this dateset to classify the age ( Common Voice).
I have use the provided train set and validate on the dev set, and then start testing on the test set based on the split provided in the kaggle website. When I use pretraiend models.
I had to convery audios to wav files and then generate the spectorgrams and pass them to CNN.

Overall, I am getting good result, and I am not really sure if I have did something wrong.

I have tried to use new version of the dataset to test my model (I took random samples of 200 per class), however, the result is not a like as I got on the test set after training and validation.

Can someone sight me if I have did something wrong…
Please my code below:

def train_model(model, criterion, optimizer, num_epochs=30):
    since = time.time()
    best_acc = 0.0

    for epoch in range(num_epochs):
        print(f'Epoch {epoch}/{num_epochs - 1}')
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'valid']:
            if phase == 'train':
                model.train()  # Set model to training mode
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0
            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs =
                labels =
                # zero the parameter gradients
                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)
                    _, preds = torch.max(, 1)
                    # backward + optimize only if in training phase
                    if phase == 'train':
                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += (preds == labels).sum().item()
            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects / len(dataloaders[phase].dataset)

            print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
            # deep copy the model
            if phase == 'valid' and epoch_acc > best_acc:
                best_acc = epoch_acc
      , 'models/final.pth')

        time_elapsed = time.time() - since
    print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
    print(f'Best val Acc: {best_acc:4f}')
if __name__ == "__main__":
    device = "cuda" if torch.cuda.is_available() else "cpu

    data_dir = 'input/50k_images_stft'
        # transforms = models.EfficientNet_V2_M_Weights.IMAGENET1K_V1.transforms()
    data_transforms = {
        'train': transforms.Compose ([
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        'valid': transforms.Compose ([
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])


    image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x] ) for x in ['train', 'valid']}

    dataloaders = {x: DataLoader(image_datasets[x], batch_size =16, shuffle = True, num_workers = 3) for x in ['train', 'valid']}

    dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'valid']}
    print ('DataSet Size:', dataset_sizes)
    class_names = image_datasets['train'].classes
    print ('Classe:', class_names)

    # xtest, ytest = create_dataset(test_df, image_dir)
    # features_test_tensor = torch.tensor(xtest)
    # target_test_tensor = torch.tensor(ytest)
    # test_set = TensorDataset(features_test_tensor, target_test_tensor)
    # test_loader = DataLoader(dataset= test_set, batch_size=64, shuffle=True)

    criterion = nn.CrossEntropyLoss(label_smoothing = 0.11)
    optimizer_ft = torch.optim.Adam(model_ft.parameters(), lr=0.0001)
    train_model(model_ft, criterion, optimizer_ft, num_epochs=50)

    #create_dataset(df, image_dir)

@ptrblck I would really appreciate any updates from your side.

Your code looks alright and I cannot see any obvious issues. I’m not sure if I understand the issue correctly, but are you seeing a worse test accuracy than what your training and validation set reports?

@ptrblck Thanks for the reply.
My training is giving high accuracy as well as the validation set.

However when I tried to used a separate test set which is balanced and not representing the same distribution in the training, validation and the test set provided by the common voice kaggle on Kaggle it gives different accuracy. Could the distribution of population affect the final test results?

Yes, the distribution will certainly have an effect on the final accuracy and often your would try to model the training and validation distribution after the “real world” set (i.e. the test set in your case).