Why I only change "shuffle=False" to "shuffle=True" in my DataLoader when I test my model, my test result will be different so much?

I use pytorch 0.4 and python 3.6
my test code is:

def test_model(dataloader, model, test_num=3000):
    device = torch.device("cuda" if opt.use_cuda else "cpu")
    total = 0
    correct_pos = 0
    correct_num = 0
    predicted_pos = 0
    pos_num = 0
    with torch.no_grad():
        for ii, (img, label) in tqdm(enumerate(dataloader)):
            label = label.view(len(label)).numpy()
            img = img.to(device)
            output = model(img)  # 1*3*299*299
            _, predicted = torch.max(output.data, dim=1)
            predicted = predicted.cpu().numpy()
            total += len(label)
            predicted_pos += predicted.sum()
            correct = predicted == label
            correct_num += correct.sum()
            correct_pos += np.logical_and(label, correct).sum()
            pos_num += label.sum()
            if ii > test_num:
                break
    recall = correct_pos / pos_num
    precision = correct_pos / predicted_pos
    accuracy = correct_num / total
    neg_precision = (correct_num - correct_pos) / (total - pos_num)
    f_num = 2 * recall * precision / (recall + precision)
    # print('Accuracy of the network :' + str(accuracy))
    # print('Recall of the network :' + str(recall))
    # print('Precision of the network :' + str(precision))
    # print('Neg_precision of the network :' + str(neg_precision))
    return {"Recall": recall,
            "Accuracy": accuracy,
            "Precision": precision,
            "Neg_precision": neg_precision,
            "F": f_num}

when shuffle=True

{‘Recall’: 0.7745604963805585, ‘Accuracy’: 0.9315490043961727, ‘Precision’: 0.9413095387708935, ‘Neg_precision’: 0.9838965517241379, ‘F’: 0.8498326431043286}

when shuffle=False

{‘Recall’: 0.19451913133402274, ‘Accuracy’: 0.639384535815878, ‘Precision’: 0.23404255319148937, ‘Neg_precision’: 0.7877241379310345, ‘F’: 0.2124583498051618}

You probably don’t want to shuffle your test data loader … it is even safer to test the samples one by one, I think.

but I want know what cause the differences. Do you know?

It’s probably because that you don’t enumerate over all test data

I should use model.eval(),before I test it

but why model.eval influence that ?

By using shuffle=True you shuffle up the dataset which makes the training batches more generalized which in turn makes your model more generalized thus you model performs much better on the unseen data