CNN inference - different results on each run

I have a trained PyTorch CNN model that is attempting to classify 7 different blocks. When I run inference on the same batch of images each run gives different results, for example,

=====================
Run 1
pred truth match

block_1 block_0 no
block_1 block_1 yes
block_1 block_1 yes
block_1 block_2 no
block_1 block_2 no
block_1 block_2 no
block_1 block_3 no
block_6 block_3 no
block_6 block_3 no
block_1 block_4 no
block_1 block_4 no
block_1 block_5 no
block_1 block_5 no
block_1 block_6 no
block_1 block_6 no

Run 2
pred truth match

block_3 block_0 no
block_3 block_1 no
block_3 block_1 no
block_3 block_2 no
block_3 block_2 no
block_6 block_2 no
block_3 block_3 yes
block_3 block_3 yes
block_3 block_3 yes
block_3 block_4 no
block_3 block_4 no
block_3 block_5 no
block_3 block_5 no
block_3 block_6 no
block_3 block_6 no

My model loader fucntion,

def load_model(model, model_dir='models', model_file_name='blocks.pt'):

    model_path = os.path.join(model_dir, model_file_name)
    model.load_state_dict(torch.load(model_path), strict=False)

    return model

My prediction fucntion,

def prediction(model, device, batch_input):

    model.to(device)
    model.eval()

    data = batch_input.to(device)
    output = model(data)

    prob = F.softmax(output, dim=1)
    pred_prob = prob.data.max(dim=1)[0]
    pred_index = prob.data.max(dim=1)[1]

    return pred_index.cpu().numpy(), pred_prob.cpu().numpy()

My image transforms to match image training set dimensions etc,

def image_common_transforms(mean=(0.4515, 0.3976, 0.3339), std=(0.3639, 0.3361, 0.3224)):

    common_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean, std)
    ])

    return common_transforms

The input batches are identical in each case. I was under the impression that model.eval() would stop any random behaviour. Am I misunderstanding something here?

Check the absolute differences in the model output as well as intermediate activations to narrow down where the differences are coming from.

HI - could you advise how to do that?

You could use forward hooks or directly store the activations in the forward method in e.g. a list from different runs and compare them later.

Thanks the forward hooks thing is a great tool. I found my error, I was missing an activation layer when running the model for inference.

This wouldn’t explain why rerunning the same model (with the missing activation layer) with the same input yields different results, would it?

I trained the model on a different computer with a different code. I wrote the inference code on different computer & forgot to add an activation layer after one of the fully connected layers. Now everything is working as expected.