Wierd performance of a pytorch DNN

I have a question and need an explanation. I trained a DNN on a dataset of speech embeddings. The validation accuracy during the training is 99.5% then I stopped the training and save the best model. When I did the inference on the test dataset the accuracy is 91% which is acceptable. The problem is when did the inference on the validation set the accuracy should be approximately the same but in my case, it is 84%. Can someone give me an explanation for this?

Make sure to use the same setup during your validation run in the training and inference script.
I.e. in particular check that both approaches call model.eval(), use the same data preprocessing etc.