Problem:
I have a testset of samples that is too large for classification in one single run (memory error).
The testset is structured as: [0…1…2] where there is 400 ‘0’, 400 ‘1’ and 400 ‘2’ => 1200 samples.
(The trained model yields ~80% validation accuracy => I expect ~80% in test accuracy.)
Implemented solution:
test_loader = DataLoader(dataset=testset,batch_size=400,shuffle=False)
Result: test accuracy of batches: [25%,65%,35%] => Can not be correct!
If I change to:
test_loader = DataLoader(dataset=testset,batch_size=400,shuffle=True)
Result: test accuracy of batches: [77%,80%,79%] => This seems legit!
How can the results be so different if the dataloader load batches of mixed classes instead of batches with only one class? I am baffled, the model should not care about what it classifies!?
Code:
testset = DATA(train_X,train_Y) test_loader = DataLoader(dataset=testset,batch_size=400,shuffle=False) for i, data in enumerate(test_loader, 0): x_test, y_test = data with torch.no_grad(): output_test = model(x_test.cuda().float()) preds_test = np.argmax(list(torch.exp(output_test).cpu().numpy()), axis=1) acc_test = accuracy_score(y_test, preds_test) print(acc_test)