Model.eval() accuracy is low

eqy · June 16, 2021, 6:58pm

You might want to do something like:

with torch.no_grad():
    s = sum([torch.sum(p) for p in model.parameters() if p.requires_grad])

eqy · June 16, 2021, 7:33pm

Does the value change in the training loop?

Anto_Skar · June 16, 2021, 8:03pm

Yes they do change. The model.eval again has the issue

eqy · June 16, 2021, 8:06pm

Do you mean that at validation time the value is always the same (across epochs?)
It might be useful to see an updated version of the code (for both training and validation) since I can’t easily understand what the current version with changes looks like.

eqy · June 16, 2021, 8:52pm

It is a bit difficult to see the overall organization? Is the training and validation code in the same function? You also should not need to explicitly called set_grad_enabled here.

Also, the loss calculation looks a bit strange here. Rather than doing outputs.cpu(), you likely want to do gt_data = gt_data.cuda(). The conversion to numpy also looks extraneous. Finally, I don’t see a loss.backward() or an optimizer.step() which seems concerning.

Anto_Skar · June 16, 2021, 8:59pm

No they are not in the same function. When they were in the same function the training and validation accuracy were fine. But when I create different functions for training and validation, the validation accuracy drops sharply and remains constant.
Don’t worry about the loss. backward part I just didn’t posted the whole code since it’s huge and can be overwhelming.

eqy · June 16, 2021, 9:02pm

If you have a link to a repo version somewhere that can help with viewing larger files. As it is I think the issue is probably something superficial with a stale copy of the model being reused somehow, but it is difficult to debug without seeing the overall context.

Anto_Skar · June 16, 2021, 9:20pm

I will create a GitHub repository and send it to you as soon as possible. Thank you again for all your help.
One more question, regarding the parameters that do not change. Does this mean that the model is stuck on a layer of the NN and doesn’t go past this?

eqy · June 16, 2021, 9:21pm

I think the issue is that an old version of the model is being kept around somehow so the validation function isn’t actually evaluating the most recent model.

eqy · June 18, 2021, 5:41pm

OK, out of curiosity, is this code also what you used for tracking validation accuracy in the training function? I only see an accuracy calculation in if phase == 'train':, which makes me unsure how the validation version was implemented here.

Anto_Skar · June 18, 2021, 6:00pm

Yes. When I only had one function, I separated the two phases with if and else statements in order to get both training and validation accuracies in each epoch

eqy · June 18, 2021, 6:26pm

Basically without seeing that version it is hard to see if the error is with the current validation code or the previous validation code. Since the training is only run once and the validation code is run after all training completes in this version, we wouldn’t expect the validation accuracy to change between “epochs.”

With the current training loop, a potential error is that only the training step computes an accuracy, and the validation branch just silently reuses this result without recomputing so the training accuracy is reported as validation accuracy.

eqy · June 18, 2021, 6:59pm

Taking a second look, I still don’t understand why the (current code) has a best_model_wts = copy.deepcopy(model.state_dict()) at the beginning of the training function with model.load_state_dict(best_model_wts) at the end when the best_model_wts is never updated. This means that the training will have no effect on the evaluation as the initial (random?) weights will be loaded and set in the model before the model is returned.

I would at least remove both best_model_wts = copy.deepcopy(model.state_dict()) and model.load_state_dict(best_model_wts) for now.