Using DataLoader yields different results for shuffle: (True/False)

Problem:
I have a testset of samples that is too large for classification in one single run (memory error).
The testset is structured as: [0…1…2] where there is 400 ‘0’, 400 ‘1’ and 400 ‘2’ => 1200 samples.
(The trained model yields ~80% validation accuracy => I expect ~80% in test accuracy.)

Implemented solution:
test_loader = DataLoader(dataset=testset,batch_size=400,shuffle=False)
Result: test accuracy of batches: [25%,65%,35%] => Can not be correct!

If I change to:
test_loader = DataLoader(dataset=testset,batch_size=400,shuffle=True)
Result: test accuracy of batches: [77%,80%,79%] => This seems legit!

How can the results be so different if the dataloader load batches of mixed classes instead of batches with only one class? I am baffled, the model should not care about what it classifies!?

Code:

testset        = DATA(train_X,train_Y)
test_loader    = DataLoader(dataset=testset,batch_size=400,shuffle=False)
for i, data in enumerate(test_loader, 0):
    x_test, y_test = data
    with torch.no_grad():
        output_test = model(x_test.cuda().float())
    preds_test      = np.argmax(list(torch.exp(output_test).cpu().numpy()), axis=1)
    acc_test        = accuracy_score(y_test, preds_test)
    print(acc_test)

I have no idea as to why the above occurs, but I was able to go around the “mystery” by the following, for future readers:

1. Return the indices of the batches from the class of your Dataset

class DATA(Dataset):

    def __init__(self,x,y,transform):
        self.x_data    = x
        self.y_data    = y
        self.len = self.x_data.shape[0]

    def __getitem__(self, index):
        X_data = self.x_data[index]
        Y_data = self.y_data[index]
        return X_data,Y_data, index  ########## Returns the indices

    def __len__(self):
        return self.len

2. Insert the prediction at the correct indices:

preds = np.array(train_Y) ## Copies the whole target vector (1200 samples in my case)

for i, data in enumerate(test_loader, 0):

x_test, y_test, index = data
index = index.detach().numpy()
with torch.no_grad():
    output_test = model(x_test.cuda().float())
preds_test      = np.argmax(list(torch.exp(output_test).cpu().numpy()), axis=1)
preds[index]    = preds_test   ### Insert batch (400 samples) predictions at correct indices in the 1200 array.

How are you calculating the accuracy given your predictions?
Are you setting your model to .eval() before executing the test pass?

Accuracy score by:
‘from sklearn.metrics import accuracy_score’

And the model is the last one after training has ended, it is not loaded and I do not call ‘model.eval()’
It works with the indices, so you really do not have to spend time on this topic, but big thanks for being here :slight_smile:

Good to hear it’s working, but I would still make sure there is no bug in PyTorch somewhere. :wink:

If you keep the model in model.train() and use e.g. batch norm layers, the running statistics will still be updated.
If you shuffle the test data, these stats should represent the dataset better than a non-shuffled dataset.
Note that this could be considered a data leakage.

2 Likes

Changed to model.eval() before prediction.
And it is now producing the same results for both shuffle: ‘False’ and ‘True’.
(I have both Dropouts and batch norm layers in my model)

Thanks! And no bug in Pytorch!

1 Like

hi @ptrblck,
Can you please explain a bit more about how the batch norm stats causing data leakage(when we shuffle the data on the validation set)?
My current model’s (validation)results have 20-25% difference if I change this single parameter(ie. shuffle=True to shuffle-False). Ideally, I shouldn’t shuffle the data on validation set but these results are quite deceptive.

If you don’t switch to .eval() before using the evaluation dataset (or test set), the batch norm stats will be updated, so your model will “train” using this dataset, which is considered a data leakage.
Shuffling the validation or test dataset might even result in “better” stats to leak.

Note that shuffling does not change anything, if you set the model properly to eval() before validating the model.

1 Like

Got your point regarding .eval() mode, but I fear that shuffle is changing my validation results. My code snippet looks something like this

for each epoch:
    self.model.train()
    # train data for each batch
    self.model.eval()
    # validate

my validation results differ by 20-25% with the change of the shuffle parameter. Note that I am changing the shuffle parameter of my validation data loader (not train data loader).

While debugging I stumbled upon this StackOverflow question and surprisingly OP’s 2nd scenario prefectly fits mine. When i turn off the shuffle (on my validation data loader) there is a huge difference between my train and validation accuracy(train being 95% while validation stuck on 71%) , but the difference vanishes as the shuffle turned on.

What model are you using?
Do you have a custom model with e.g. functional dropout calls in your forward method?
Could you post a code snippet to reproduce this issue?

I have the same problem as well. My model is fc->bn->relu->fc->bn->logsoftmax
I have already set model.eval() before test,

If I set (shuffle=true), the accuracy is about Accuracy: 2270/3200 (70.94%)
Testing accuracy: 70.94

If I set (shuffle=false), the accuracy is about Test set: Average loss: 3.3379
Accuracy: 757/3200 (23.66%)
Testing accuracy: 23.66

There is a huge difference

Could you post your code so that we could have a look?

Hi everybody,
I have a similar problem. I have trained my model (a CNN with batch normalizations and dropouts) and stopped the learning process after I reached ~80% on the validation set (using dataloader with suffle=True). I saved my model with

torch.save(self.model.state_dict(), filename)

I wrote another script, with the following code:

def simple_validator(device, model_file, data_folder, chunk_size):
    val_data = riq.IQDataset(data_folder=data_folder, chunk_size=chunk_size, validation=True)
    val_data.normalize(torch.tensor([-3.1851e-06, -7.1862e-07]), torch.tensor([0.0002, 0.0002]))
            
    model = brain.CharmBrain(chunk_size)
    model.load_state_dict(torch.load(model_file))
    model.to(device)
    model.eval()
            
    tot = 0 
    correct = 0
    with torch.no_grad():
        for chunk, label in val_data: 
            chunk = chunk.to(device, non_blocking=True)
            output = model(chunk.unsqueeze(0))
            _, predicted = torch.max(output, dim=1) 
            if predicted.item() == label:
                correct += 1
            tot += 1
    print(f"Accuracy: {correct/tot}")
                    
                        
def validator(device, model_file, data_folder, chunk_size):
    val_data = riq.IQDataset(data_folder=data_folder, chunk_size=chunk_size, validation=True)
    val_data.normalize(torch.tensor([-3.1851e-06, -7.1862e-07]), torch.tensor([0.0002, 0.0002]))
    val_loader = torch.utils.data.DataLoader(val_data, batch_size=512, shuffle=True, num_workers=20, pin_memory=True)
                
    model = brain.CharmBrain(chunk_size) 
    model.load_state_dict(torch.load(model_file))
    model.to(device)
    model.eval()
    
    tot = 0
    correct = 0
    with torch.no_grad():
        for chunks, labels in val_loader:
            chunks = chunks.to(device, non_blocking=True)
            labels = labels.to(device, non_blocking=True)
            output = model(chunks)
            _, predicted = torch.max(output, dim=1)
            correct += int((predicted == labels).sum())
            tot += labels.shape[0]
        print(f"Accuracy: {correct/tot}")
     

The two functions (validator and simple_validator) operate with the same model parameter file and the same data; however:

  1. multiple runs of simple_validator return the same value, ~0.336
  2. multiple runs of validator return different values, but all around ~0.81
  3. multiple runs of validator setting shuffle=False return different values, but all around 0.3365

(Note, this model is a classifier with 3 classes, so accuracy of 0.33 means random output)

Any idea, where the bug could be?

Thanks,
Luca

I retrained my model without Batch normalization and AlphaDropout. During learning, I could achieve an accuracy on the validation set (through the DataLoader, shuffle=True) of ~0.48, and now:

  1. validator obtains different values, but in the same range of ~0.48 w/ and w/o shuffle=True
  2. the result of validator and simple validator always coincide to the same value when num_workers=1 (with both shuffle=True, and shuffle=False)

My naive conclusions: model.eval() does not completely fix the values of the batch normalization and dropout layers; there is something weird happening when using multiple processes to load the data.

Can anybody explain any of these results?

model.eval() should use the running stats, if the batchnorm layers were created in the default setup. Could you check, if you are initializing them with track_running_stats = False?
Also, dropout should be disabled, if the nn.Dropout module is used or the functional API via F.dropout(input, training=self.training). Could you check, if you are using the functional API and are not switching the training flag?

Hi @ptrblck, thank you for your reply; indeed I was using track_running_stats=False. However, I am using dropouts with default settings (even if they are AlphaDropouts, to be precise).
If in validation, batch norm uses the wrong stats, that could explain the drop in performance.
However, I am more puzzled by why using a loader (num_workers=1) gives different results wrt to not use it; and, by why using multiple loading processes (num_workers>1) gives different values at each run.

I wouldn’t say it’s using the wrong stats, but the batch statistics will be used, if track_running_stats=True is set.

If you are using 3rd party libs in the data loading pipeline, you might need to seed them in the worker_init_fn. Since the fork method is used for multiprocessing on Linux, the random states might be cloned from the parent process.

Hi,
it turned out the dataset class I had been using had a race condition bug. Whenever I used more than
one process to fetch the data, I stumbled with weird data.
Once fixed the bug and removed the track_running_stats=False flag, everything started to work as expected.

Best,
Luca