Predictions run on the same data differ a lot

I have been trying recently to fit a model for cat/dog recognition and noticed a strange behaviour. At the end of every training epoch I ran validation. When shuffle flag in my validation DataLoader was set to true, the loss on validation was close to the training loss. However, when I switched shuffle to false, suddenly the validation loss became much worse (but still better than random guessing). During my investigation I encountered a simpler problem (at least in terms of minimal example), which I describe below.

I have trained a model and now I’m trying to use it for predicting. The code is below.

import numpy as np
import pandas as pd
from PIL import Image
from typing import Callable, List
from pathlib import Path

from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from torch.nn.functional import softmax
import torch

from imgrec.models import get_model, LoadParams
from imgrec.utils import ProblemType


class ImageBag(Dataset):
    def __init__(self, paths: List[str], transform: Callable):
        self.paths = paths
        self.transform = transform

    def __getitem__(self, i):
        img = Image.open(str(self.paths[i])).convert('RGB')
        return self.transform(img), self.paths[i]

    def __len__(self):
        return len(self.paths)


def load_model():
   ... # skipped for brevity

def main():
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])
    paths = [
        "/mnt/ml-team/homes/grzegorz.los/cats_and_dogs/valid/cat/cat.10114.jpg",
        "/mnt/ml-team/homes/grzegorz.los/cats_and_dogs/valid/cat/cat.5516.jpg"
    ]
    img_bag = ImageBag(paths, transform=transform)
    loader = DataLoader(img_bag,
                        batch_size=2,
                        shuffle=False,
                        num_workers=0)
    model = load_model()
    model = model.cuda()
    for _ in range(5): # repeat prediction a few times
        for batch in loader:
            with torch.no_grad():
                im_tensor, paths_tuple = batch
                im_tensor = im_tensor.cuda()
                logits_tensor = model(im_tensor)
                probs = softmax(logits_tensor, dim=1)
                for path, prob in zip(paths_tuple, probs):
                    print(Path(path).stem, prob.cpu().detach().numpy())
                print('-'*50)

if __name__ == '__main__':
    main()

What the code does is essentially:

  • prepare a data loader of two images,
  • load a model,
  • use the model to predict labels of these two images a few times (in a loop).

And this is the output I received:

cat.10114 [0.96862036 0.03137956]
cat.5516 [0.7262352  0.27376482]
--------------------------------------------------
cat.10114 [0.9201531 0.0798469]
cat.5516 [0.8188935  0.18110651]
--------------------------------------------------
cat.10114 [0.97118205 0.02881794]
cat.5516 [0.92866737 0.07133257]
--------------------------------------------------
cat.10114 [0.949634   0.05036597]
cat.5516 [0.8162648 0.1837352]
--------------------------------------------------
cat.10114 [0.95325416 0.04674589]
cat.5516 [0.8434967  0.15650335]
--------------------------------------------------

Probability of showing a cat varies from 0.92 to 0.97 on one image, and from 0.73 to 0.93. I realize that any computation on GPU involves a random factor, but difference of ~0.2 must be a bigger issue.

I will be grateful for every advise.

torch==0.4.0
torchvision==0.2.1

You should set your model to evaluation using model.eval().
This will change the behavior of some layers, e.g. nn.Dropout won’t drop units anymore and nn.BatchNorm will use its running estimates instead of the batch statistics.

model.eval() solved my original problem as well, thanks!

Still, I find it very interesting that predictions were quite good when validation data was shuffled and much worse without shuffling (i.e. first I had a series of batches of one category and then a series of batches of the other category). I guess it’s the matter of these “running estimates”, but it’s quite surprising for me that the influence was so big.

Good to hear it’s working now!

You mean as you’ve kept model.train()?
If so, I think you are right. If you sort the data according to the classes, the mean and std of the samples might differ quite a lot between classes, so that the running estimates will get a higher influence of the last classes.
Of course it depends on the data you are using and since you are apparently using cat and dog pictures I wouldn’t expect the effect to be that bad.