Pretrained model labels are reversed during evaluation

Hi there,

Thanks in advance for the help.

Transfer learning with several pre-trained torchvision models on an imagefolder dataset. When I run my evaluation loop, the accuracy runs below what you would expect from random chance the more the model trains. But if I throw a little hack in there and reverse the labels, the accuracy shoots up to a level comparable to the training accuracy. I printed out the dictionaries with the labels for my evaluation and training datasets, and the same numbers correspond to the same labels.

i.e. running this:

    trainDataset = torchvision.datasets.ImageFolder(trainDataPath, 
    trainLoader = DataLoader(trainDataset,batch_size=batchSize, shuffle=True)
    numCategories = len(trainDataset.classes)
    valDataset = torchvision.datasets.ImageFolder(valDataPath, 
    valLoader = DataLoader(valDataset,batch_size=batchSize, shuffle=True)


returns identical dictionaries:

{‘LeftFractures’: 0, ‘LeftNon-Fractures’: 1, ‘RightFractures’: 2, ‘RightNon-Fractures’: 3}
{‘LeftFractures’: 0, ‘LeftNon-Fractures’: 1, ‘RightFractures’: 2, ‘RightNon-Fractures’: 3}

My evaluation loop runs as a separate function:

def evaluateModel(model, valLoader, lossFunction, sensitivityFunction, specificityFunction, cuda, numCategories, labelSmoothing=False):


    model = model.eval()

    lossList = []
    accuracyList = []

    sensitivityList = []
    specificityList = []

    predictions = torch.LongTensor([]).cuda()
    ys = torch.LongTensor([]).cuda()
    for batch, (x,y) in enumerate(valLoader):
        #x = torch.randn_like(x) # FIXME: remove later
        if cuda:
            y = y.cuda()
            x = x.cuda()

        yHat = model(x)

        predictedCategory = torch.argmax(yHat, dim=1)
        predictedCategory = (predictedCategory * -1) + numCategories - 1 # **hack because the model's predictions seem to be reversed somehow** 

        predictions =, predictedCategory), dim=0)
        ys =, y), dim=0)

        if labelSmoothing:
            # note that if labelSmoothing = True and the loss is something like BCE, it will throw an error
            smoothY = smoothLabel(y, numCategories, alpha = 0.2, cuda=cuda)
            smoothY = y

        loss = lossFunction(yHat, smoothY)

    model = model.train()

    sensitivities = sensitivityFunction(predictions, ys, numCategories)
    specificities = specificityFunction(predictions, ys, numCategories)

    accuracy = torch.sum(predictions == ys).item() / predictions.shape[0]

    return np.mean(lossList), sensitivities, specificities, accuracy


Note that it prints the same dictionary as before:

{‘LeftFractures’: 0, ‘LeftNon-Fractures’: 1, ‘RightFractures’: 2, ‘RightNon-Fractures’: 3}

For comparison, this is my training loop:

       for batchNum, (x, y) in enumerate(trainLoader):

            if cuda:
                x = x.cuda()
                y = y.cuda()

            yHat = model(x)

            predCat = torch.argmax(yHat, dim=1)

            acc = torch.sum(predCat == y).item() / y.shape[0]

            if labelSmooth:
                # note that if labelSmoothing = True and the loss is something like BCE, it will throw an error
                smoothY = smoothLabel(y, numCategories, alpha = 0.2, cuda=cuda)
                smoothY = y

            loss = lossFunction(yHat, smoothY)

Brownie points and all manner of gratitude if anyone can help me understand what I am doing wrong to make my model’s predictions come out reversed during evaluation time!

Also note that the phenomenon is consistent across several models, from Alexnet to Googlenet. In each case I replace the final linear layer to match the image size and number of categories (4).

Thanks again!

Did you check the images in the training and validation folders and made sure they weren’t copied in a wrong order?

Hi @ptrblck, thanks for the quick response!

The files were individually labeled according to their category prior to any sort of segmentation into train/test/val. When I manually check the names in the category sub folders in the train, test, and val folders, they all correspond to the correct category. Furthermore, I separated the original data into train/test/val groups programatically, and when I review my code, I don’t see any room for a mishap. If you would like to double check that, here’s my code:

import os
import random
import shutil

def getTrainTestVal(inputFolder):

    files = []
    for file in os.listdir(inputFolder):
        files.append(inputFolder + "/" + file)

    numFiles = len(files)
    eightyPercent = int(0.8 * numFiles)
    tenPercent = int(0.1 * numFiles)

    train = files[0:eightyPercent]
    test = files[eightyPercent:(tenPercent + eightyPercent)]
    val = files[(tenPercent + eightyPercent):]

    print(len(train) + len(test) + len(val))

    return train, test, val

def makeSubDirectories(root):
    lf = root + "/LeftFractures"
    rf = root + "/RightFractures"
    lnf = root + "/LeftNon-Fractures"
    rnf = root + "/RightNon-Fractures"
    directories = [lf, rf, lnf, rnf]
    for directory in directories:
        if not os.path.exists(directory):

def makeDirectories(root):
    base = root + "/Separated"
    if not os.path.exists(base):
    train = base + "/Train"
    test = base + "/Test"
    val = base + "/Val"

    directories = [train, test, val]
    for directory in directories:
        if not os.path.exists(directory):

originalRoot = "path_to_original_folder"
leftF = originalRoot + "/Fracturesjpg/LeftFractures"
rightF = originalRoot + "/Fracturesjpg/RightFractures"
leftNF = originalRoot + "/Nonfracturesjpg/LeftNon-Fracture"
rightNF = originalRoot + "/Nonfracturesjpg/RightNon-Fracture"

LFtrain, LFtest, LFval = getTrainTestVal(leftF)
RFtrain, RFtest, RFval = getTrainTestVal(rightF)
LNFtrain, LNFtest, LNFval = getTrainTestVal(leftNF)
RNFtrain, RNFtest, RNFval = getTrainTestVal(rightNF)

root = "location_of_output_directories"

def copyOver(root, imageType, dataCategory, imageList):
    root = root + "/Separated"
    root = root + "/" + dataCategory
    root = root + "/" + imageType + "/"

    for image in imageList:
        imageList = image.split("/")
        path = root + imageList[-1]
        shutil.copy(image, path)

copyOver(root, "LeftFractures", "Train", LFtrain)
copyOver(root, "RightFractures", "Train", RFtrain)
copyOver(root, "LeftNon-Fractures", "Train", LNFtrain)
copyOver(root, "RightNon-Fractures", "Train", RNFtrain)

copyOver(root, "LeftFractures", "Test", LFtest)
copyOver(root, "RightFractures", "Test", RFtest)
copyOver(root, "LeftNon-Fractures", "Test", LNFtest)
copyOver(root, "RightNon-Fractures", "Test", RNFtest)

copyOver(root, "LeftFractures", "Val", LFval)
copyOver(root, "RightFractures", "Val", RFval)
copyOver(root, "LeftNon-Fractures", "Val", LNFval)
copyOver(root, "RightNon-Fractures", "Val", RNFval)

Furthermore still, this is the second script I have written to separate the data into train/test/val groups. I redid the script thinking perhaps that somehow I had done just what you suggested, and mixed the labels up while separating the data.

Is there anything else I can provide to help?

Your code looks alright. Especially since you used the folder and class names directly, which doesn’t leave much room for errors.

If you don’t remap the validation targets, is the model worse than a random prediction?
This is indeed the first time I’m seeing such an issue and usually there is something wrong with the data.
Also, could you load manually some validation data, check the class manually (I assume you can classify the images), and check the predictions again? While the splitting seems to be alright, there might still be some mismatches in the overall data loading pipeline. E.g. are you using a custom sampler or collate_fn?