Why is the dropout layer affecting all values, not only the ones set to zero?

nicolas-gervais · November 23, 2019, 2:23am

The dropout layer from Pytorch changes the values that are not set to zero. Using Pytorch’s documentation example: (source):

import torch
import torch.nn  as nn

m = nn.Dropout(p=0.5)
input = torch.ones(5, 5)

print(input)
tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

Then I pass it through a dropout layer:

output = m(input)
print(output)
tensor([[0., 0., 2., 2., 0.],
        [2., 0., 2., 0., 0.],
        [0., 0., 0., 0., 2.],
        [2., 2., 2., 2., 2.],
        [2., 0., 0., 0., 2.]])

The values that aren’t set to zero are now 2. Why?

ptrblck · November 23, 2019, 2:29am

You need to scale the activations either

during testing with the keep probability or
during training with 1/keep_prob

Otherwise the expected activation values will have a different range and your model will perform poorly.
PyTorch uses the second approach.

seyeeet · January 13, 2022, 8:10pm

hi @ptrblck can you provide a little more explanation on why we need to do one of the two solutions that you mentioned and why otherwise it will cause the poor performance?

ptrblck · January 13, 2022, 10:41pm

The expected value range of the output activations wouldn’t match, if you are dropping activations during training with the specified p value and are using them during evaluation.
From the dropout paper:

At test time, it is not feasible to explicitly average the predictions from exponentially
many thinned models. However, a very simple approximate averaging method works well in
practice. The idea is to use a single neural net at test time without dropout. The weights
of this network are scaled-down versions of the trained weights. If a unit is retained with
probability p during training, the outgoing weights of that unit are multiplied by p at test
time as shown in Figure 2. This ensures that for any hidden unit the expected output (under
the distribution used to drop units at training time) is the same as the actual output at
test time.