Validation and test results not the same for same data

I’m new to the joys of PyTorch and am using someone else’s code, so forgive what feels like a naive question. I am training an image recognition model (based on inceptionv3) based on code in GitHub - macaodha/inat_comp_2018: CNN training code for iNaturalist 2018 image classification competition. I’ve only run a few epochs while I get a sense of how the code works.

As the code trains the model it checks it against a validation set of images.Once trailing is complete you can run a test set of images against the model to get predictions.

My expectation was that if I used the validation data set as “test” data I would get the same set of predictions as the last round of training generates (because the “test” and validation data are the same). However the test results generate very poor scores, and suggest values for the identity of the images that do not exist in the training data.

I suspect that the code is not correctly loading the saved model when it does the testing. But can anyone confirm that my assumption (validation results for last training round = test results if test data = validation data) are correct?

To add some detail, this is the code to load the model:

def build_model_and_optim():
    global device, args, resume
    # load pretrained model
    print("Using pre-trained inception_v3")
    # use this line if instead if you want to train another model
    #model = models.__dict__[args.arch](pretrained=True)
    model = inception_v3(pretrained=True)
    model.fc = nn.Linear(2048, args.num_classes)
    model.aux_logits = False
    model = model.to(device)

    optimizer = SGD(model.parameters(), args.lr,
                    momentum=args.momentum,
                    weight_decay=args.weight_decay)
    # optionally resume from a checkpoint
    if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}' for inaturalist-inception".format(
                args.resume))
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            best_prec3 = checkpoint['best_prec3']
            model.load_state_dict(checkpoint['state_dict'], strict=False) # https://stackoverflow.com/questions/63057468/how-to-ignore-and-initialize-missing-keys-in-state-dict
            optimizer.load_state_dict(checkpoint['optimizer'])
            print("=> loaded checkpoint '{}' (epoch {})".format(
                args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

    return model, optimizer

If training, the following loop is invoked:

    for epoch in range(args.start_epoch, args.epochs):
        adjust_learning_rate(optimizer, epoch)

        # train for one epoch
        train(train_loader, model, criterion, optimizer, epoch)

        # evaluate on validation set
        if 1:
            prec3, preds, im_ids = validate(val_loader, model, criterion, True)
            with open('predictions-epoch-' + epoch + '.csv', 'w') as opfile:
                opfile.write('id,predicted\n')
                for ii in range(len(im_ids)):
                    opfile.write(str(im_ids[ii]) + ',' + ' '.join(str(x) for x in preds[ii,:])+'\n')            
        else:
            prec3 = validate(val_loader, model, criterion, False)

        # remember best prec@1 and save checkpoint
        is_best = prec3 > best_prec3
        best_prec3 = max(prec3, best_prec3)
        save_checkpoint({
             'epoch': epoch + 1,
             #'arch': args.arch,
             'state_dict': model.state_dict(),
             'best_prec3': best_prec3,
             'optimizer' : optimizer.state_dict(),
        }, is_best)

This seems to work, the model gets better over time, and the checkpoints are saved. But testing fails (even though validation works).

Using strict=False in:

model.load_state_dict(checkpoint['state_dict'], strict=False)

is generally not recommended unless you know why it’s used and why mismatches are expected.
Did you check that the returned mismatched keys are expected?

If so, you could check the output of the model before and after storing using a static input (e.g. torch.ones) and see if the results are equal (up to the expected floating point precision noise).

If this test passes you could then check the data loading pipelines between the validation and test runs to see where the difference in the data comes from.

1 Like

Yes, this was a case of me trying anything to get the code to “work” (having also dealt with out of date libraries and getting tings to work with Apple’s M1. I followed the suggestion you made elsewhere RuntimeError: Error(s) in loading state_dict for Inception3: - #2 by ptrblck at that seems to help with a trivial test case. I’m now running a larger dataset to see what happens. Thank you for your help.

If anyone has the same issue, the two things I did to fix the problem were:

  1. Restore original code model.load_state_dict(checkpoint['state_dict']) as suggested by @ptrblck above.

  2. In the epoch loop just before calling save_checkpoint I added:

try:
    model_state_dict = model.module.state_dict()
except AttributeError:
    model_state_dict = model.state_dict()

Based on suggestion by @alex.veuthey here