I get better metrics with the dropout activated on inference

So I’m training an end to end model for information extraction.
Sadly the code is a bit complicated so I can’t put it here.
I noticed that my model performs way better at training than testing.
So I tried testing with model.train() and to my surprise the metrics jumped from 0.5 to 0.92.
To be sure that the dropout is causing it, I made the same test with model.train() while changing all dropout probabilities to 0 and the metrics dropped to the first 0.5.
This make me think that the dropout is deterministic in some way.
I’m using torch.checkpoint.checkpoint for many blocks that contains dropout layers. And the offciel website indicates that the mechanism of the dropout in this case change in a way that the model can have the same dropout repartition for the second pass. But the dropout repartition should change from one iteration to the next right ?
Any one can give me some insight please ?
i use PyTorch nightly last version

The dropout allows reducing the overfitting effect.
If without it the model has overfitting using it will reduce the overfiting thus improving the performance of your model in the test set

Thanks for your answer.
In fact dropout is always activated in training, it is on inference (testing) where I have problems.
The model gets way better metrics on inference with dropout activated the model.train() line.

One thought is that perhaps the dropout is compensating for something poorly specified elsewhere in the model. For instance, perhaps a certain layer doesn’t have a bias parameter and the input data has a very positive mean, but the desired target has a closer to zero-mean. The model then might benefit from dropout pulling the means closer to zero.

So two things you can try:

  • If you have max pooling layers in your model, move those to average pooling.
  • Add normalization steps throughout your model to keep the data somewhat normalized i.e. mean-zero and close to unit variance.

These ought to reduce the effect I hypothesized above, where the model is effectively relying on Dropout to get better predictions.

Hey thanks for your answer.
Actuallly I have already many normalization layers.
Also if I train the model without dropout I get good results when testing. So I think that somehow the checkpoint.checkpoint is making the same nodes dropping to zero at each step so that only a part of the model is trained.
I will verify this hypothesis and come back to you