Since dropout has different behavior during training and test, you have to scale the activations sometime.
Imagine a very simple model with two linear layers of size 10 and 1, respectively.
If you don’t use dropout, and all activations are approx. 1, your expected value in the output layer would be 10.
Now using dropout with p=0.5
, we will lose half of these activations, so that during training our expected value would be 5. As we deactivate dropout during test, the values will have the original expected value of 10 again and the model will most likely output a lot of garbage.
One way to tackle this issue is to scale down the activations during test simply by multiplying with p
.
Since we prefer to have as little work as possible during test, we can also scale the activations during training with 1/p
, which is exactly what you observe.
21 Likes