Not getting the same predictions after two consecutive validation forwards

Hi everyone,

I conducted the following sanity check experiment:

  • after defining the dataset and its respective data loaders (with shuffle=False), I perform a forward pass on the entire dataset (without modifying the weights) and compute a certain observable (in my case, the fraction of times a specific class is chosen by the model)

  • at this point, without altering the labels in the dataset, I redefine the data loader again (identically to what was done previously) and repeat the same experiment, recalculating my observable.

I obtain slightly different results in the two cases, which surprises me since I’m performing consecutive forward passes of two identical copies of the same dataset and:

  • the network remains unchanged during the forward pass (the weights don’t change)

  • the batches are sent in the same order (as shuffle = False)

Here a simplified version of the code translating the pipeline described above ([…] indicates the part of code that I omit, for the sake of conciseness)

train_ds = ImageFolder(data_dir+'/train', train_tfms) #import the dataset (performing some pre-processing on it)

train_dl = DataLoader(train_ds, batch_size, shuffle=False, num_workers=0, pin_memory=True) #defining the dataloader (shuffle=False to get batches always in the same order)

#moving dataloader on the gpu
device = "cuda:1" if torch.cuda.is_available() else "cpu" 
train_dl = DeviceDataLoader(train_dl, device)


#define a model (in my case a simple multi-layer perceptron)
[...]

# computing the observable
[...]


# redefine a dataloader identical as above

train_dl = DataLoader(train_ds, batch_size, shuffle=False, num_workers=0, pin_memory=True)
train_dl = DeviceDataLoader(train_dl, device)

# recompute the observable
[...]

just for the sake of clearness, I report the definition of the function DeviceDataLoader, used above:

def to_device(data, device):
    """Move tensor(s) to chosen device"""
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return data.to(device, non_blocking=True)

class DeviceDataLoader():
    """Wrap a dataloader to move data to a device"""
    def __init__(self, dl, device):
        self.dl = dl
        self.device = device
        
    def __iter__(self):
        """Yield a batch of data after moving it to device"""
        for b in self.dl: 
            yield to_device(b, self.device)

    def __len__(self):
        """Number of batches"""
        return len(self.dl)
    

Do you have an idea of what could be the cause of the differences between the two estimates?

Did you call model.eval() before running the comparisons?

Hi @ptrblck,
no I didn’t but to the best of my knowledge the only differences (please correct me if I’m wrong) between model.eval() and model.train() are:

  • the gradient is not computed in model.eval()
  • the BatchNorm and Dropout are not active in model.eval()

Now my model do not contains any BatchNorm or Dropout and reguard the gradient computation I put a decorator (@torch.no_grad()) upon the function that I call to perform the forward pass.

To be more specific:


bs_out = [model.validation_step(batch, num_trdata_points, label_list, 'Train') for batch in train_dl] #compute statistics on single batches
ds_out = model.validation_epoch_end(bs_out, 'Train')  #group bs stat into global (DataSet one) #collect toghetr the statistics of all the batches in the dataset


    @torch.no_grad()
    def validation_step(self, batch, num_data_points, label_list, mode):
        if mode=='Train':
            self.train()
        elif mode=='Eval':
            self.eval()
        
        images, labels = batch 
        out = self(images)                    # Generate predictions
        loss = F.cross_entropy(out, labels)   # Calculate loss
        acc = accuracy(out, labels)           # Calculate accuracy
        f = guesses(out, num_data_points, label_list) #Calculate class predictions according to the out values
        return {'val_loss': loss.detach(), 'val_acc': acc, 'val_f': f}

    def validation_epoch_end(self, outputs, mode):
        if mode=='Train':
            self.train()
        elif mode=='Eval':
            self.eval()
        
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        if mode=='Train':
            self.TrainLoss.append(epoch_loss.item())
        elif mode=='Eval':
            self.ValLoss.append(epoch_loss.item())
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()      # Combine accuracies
        if mode=='Train':
            self.TrainAcc.append(epoch_acc.item())
        elif mode=='Eval':
            self.ValAcc.append(epoch_acc.item())
        batch_fs = [x['val_f'] for x in outputs]
        #print('fraz di batches', batch_fs)
        epoch_f = torch.sum(torch.stack(batch_fs), dim=0)      # Combine accuracies  
        if mode=='Train':
            self.Trainf0.append(epoch_f[0].item())
            self.TrainMaxf.append(np.max(epoch_f.numpy()))  
            self.Trainfs.append(epoch_f.numpy())
            self.TrainOrderedfs.append(np.sort(epoch_f.numpy())[::-1])

        elif mode=='Eval':
            self.Valf0.append(epoch_f[0].item())
            self.ValMaxf.append(np.max(epoch_f.numpy()))
            self.Valfs.append(epoch_f.numpy())
            self.ValOrderedfs.append(np.sort(epoch_f.numpy())[::-1])
        
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item(), 'val_f': epoch_f}

I don’t think there should be any difference between model.train() and model.eval() in my case; am I missing something?

No, gradient calculations won’t be changed in .train() and .eval() calls and all layers using the self.training argument could change their behavior besides dropout and batchnorm layers. How large is the relative error as it could be expected non-determinism.

1 Like

Thank you for the prompt reply.

does this means that gradient is also computed calling the model in .eval() ?
I thought it would not be registered since is not used during evaluation ( we evaluate performance but we do not touch weights), as they seem to suggest, for example, here:

How large is the relative error as it could be expected non-determinism.

Variations range from ~0.5% to ~1% of the overall value (i.e. ~5e-4)

As a check I tried also to repeat the experiment in eval mode, getting similar results (again variation between observables ~5e-4)

Gradient calculation is not disabled by calling model.eval() and can be computed. PyTorch won’t compute them by default, so you would still have to call loss.backward(). To disable gradient computation you should run the model in a with torch.no_grad() context.

A relative error of 5e-4 might be expected, but also depends on the actual model. You could try to force the usage of deterministic algorithms and check the output again. If that doesn’t help you could try to isolate the operation causing the error.

1 Like