Model.eval() accuracy is low

Hello,
I am using a pretrained resnet50 to classify some images. My problem is that when I had, in the same training function, both model.train and model.eval, the accuracies where fine (about 65% train and validation accuracies) but when I tried to separate them and use different functions for each (one for the model.train and one for the model.eval), the validation accuracy dropped to 20% and it remains constant for each epoch. Does someone have an idea of what’s happening?
I quite new to all this and I don’t know why it behaves like that.

There can be many different causes of this (e.g., inadvertently using different transformations for the validation data vs. the training data). Can you post a code snippet of the evaluation functions?

Yes sure.
The transforamations I used are these ones:

data_transforms = {

    'train': transforms.Compose([

        transforms.RandomResizedCrop(224),

        transforms.RandomHorizontalFlip(),

        transforms.ToTensor()]),

    'val': transforms.Compose([

        transforms.Resize(256),

        transforms.CenterCrop(224),

        transforms.ToTensor()])

}

The functions:

def train_model(model, dataloaders, criterion, optimizer, scheduler, batch_size=5, num_epochs=10):#

    since = time.time()

    val_acc_history = []

    best_model_wts = copy.deepcopy(model.state_dict())

    best_acc = 0.0

    #pdb.set_trace()

    for epoch in range(num_epochs):

        print('Epoch {}/{}'.format(epoch, num_epochs - 1))

        print('-' * 10)

        # Each epoch has a training and validation phase

        for phase in ['train']:#, 'val']:

            if phase == 'train':

                model.train()  # Set model to training mode

            running_loss = 0.0

            running_corrects = 0

            average_precis_train = 0.001

            average_precis_train_per_class = 0.001

            loss_values = []

            gr_truth_array = np.array([]) #convet to int dtype

            preds_array = np.array([])

            gr_truth_array = gr_truth_array.astype(int)

            preds_array = preds_array.astype(int) 

            average_precision_array = np.array([]).astype(float)


            print('Iterating over data:')

            for batch_idx, (inputs, labels) in enumerate(dataloaders[phase]):

                inputs = inputs.to(device)

                labels = labels.to(device).float()

                gt_data = labels

                gt_data = gt_data.to(device)

                gt_data = gt_data.cpu().data.numpy()

                #average_precision_array = []

                # zero the parameter gradients

                optimizer.zero_grad()

                # forward

                # track history if only in train

                #pdb.set_trace()

                if phase == 'train':

                  with torch.set_grad_enabled(phase == 'train'):

                      outputs = model(inputs)

                      outputs = outputs.cpu()#.data.numpy()

                      preds = outputs.cpu().data.numpy()

                      preds = np.round(preds) #set a condition for binary

                      preds_int = preds.astype(int)

                      gt_data_np = np.round(gt_data)

                      gt_data_int = gt_data_np.astype(int)

                      gt_data = torch.from_numpy(gt_data_np)

                      loss = criterion(outputs, gt_data)

                      gr_truth_array = np.append(gr_truth_array, gt_data_int)

                      preds_array = np.append(preds_array ,preds_int)

                    # backward + optimize only if in training phase

                      if phase == 'train':

                          loss.backward()

                          optimizer.step()

                # statistics

                  gr_truth_array = np.reshape(gr_truth_array, (-1, 40))

                  preds_array = np.reshape(preds_array, (-1, 40))

                  running_loss += loss.item() * inputs.size(0)

                  running_corrects += f1_score(gt_data, preds, average="samples")

                

            if phase == 'train':

                scheduler.step()

                average_precis_train += average_precision_score(gr_truth_array, preds_array, average= "macro")

                average_precis_train_per_class += average_precision_score(gr_truth_array, preds_array, average=None)

                average_precision_array = np.append(average_precision_array, average_precis_train_per_class)

                #pdb.set_trace()

                av_precis_array = [j for i in zip(average_precision_array, attributes) for j in i]

                av_precis_array = np.array(av_precis_array)

                print("Average precision Training:", average_precis_train)      

                print("Average precision per Class Training:", av_precis_array)

            #pdb.set_trace()   

            epoch_loss = running_loss / len(dataloaders[phase].dataset)

            epoch_acc = running_corrects / len(dataloaders[phase].dataset) #running_corrects.float()

            epoch_acc = np.round(epoch_acc, decimals=4)

             

            print('{} Loss: {:.4f}'.format(phase, epoch_loss))

            print("Acc:", epoch_acc)

 

    time_elapsed = time.time() - since

    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

    model.load_state_dict(best_model_wts)

    return model, val_acc_history

The evaluation is almost the same but the model is set to model.eval and I use with torch.no_grad(): instead of set_grad_enabled

I see the condition for model.train() statement in the code but it looks like model.eval() doesn’t have a corresponding branch?

Ok can you explain a bit more? Is this what is causing this?

I’m not sure this is the issue yet, but I don’t see model.eval() anywhere in the code you posted, just model.train().

Have you inspected the outputs of the model to see if they behave strangely during validation? For example, are they stuck at the same output (or the same class) for every example? Does the validation accuracy change at all between epochs?

I will and I will let you know.

The accuracy stays the same in every epoch

What happens when you remove the model.load_state_dict(best_model_wts)? It looks like the best model is never updated so this may just return the same model every iteration.

took it out but didn’t work. Nothing changes :confused:
The accuracy stays the same again

Ok, then can you verify the data is changing along with the model predictions during validation? Or are the predictions the same regardless of the input?

It seems that the outputs change with every iteration, so I guess there is no issue there

You might want to also add a sanity check that the model parameters are changing between validation epochs.

Can you tell me how to do that? Maybe give me an example or something?

This code gives an example of how to count the number of parameters in the model.
How do I check the number of parameters of a model? - PyTorch Forums
If you want to check that the parameters are changing, you can try printing the sum of the parameters rather than the count and see if this is changing between training epochs.

Thank you very much. I’ll try it tomorrow :slightly_smiling_face:

Hello again,
So in this line of code
def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad)
since it returns the sum of the parameters, I should only take out the numel() in order to get the sum right?

Something like that. You might need to do a second sum if you end up with just a list of summed parameters for each layer (or you can just compare them directly if the ordering is the same).

Ok, because I got this error here

----> return sum(p for p in model.parameters() if p.requires_grad)

RuntimeError: The size of tensor a (7) must match the size of tensor b (64) at non-singleton dimension 3