Why is the dropout layer affecting all values, not only the ones set to zero?

The dropout layer from Pytorch changes the values that are not set to zero. Using Pytorch’s documentation example: (source):

import torch
import torch.nn  as nn

m = nn.Dropout(p=0.5)
input = torch.ones(5, 5)
print(input)
tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

Then I pass it through a dropout layer:

output = m(input)
print(output)
tensor([[0., 0., 2., 2., 0.],
        [2., 0., 2., 0., 0.],
        [0., 0., 0., 0., 2.],
        [2., 2., 2., 2., 2.],
        [2., 0., 0., 0., 2.]])

The values that aren’t set to zero are now 2. Why?

You need to scale the activations either

  • during testing with the keep probability or
  • during training with 1/keep_prob

Otherwise the expected activation values will have a different range and your model will perform poorly.
PyTorch uses the second approach.

hi @ptrblck can you provide a little more explanation on why we need to do one of the two solutions that you mentioned and why otherwise it will cause the poor performance?

The expected value range of the output activations wouldn’t match, if you are dropping activations during training with the specified p value and are using them during evaluation.
From the dropout paper:

At test time, it is not feasible to explicitly average the predictions from exponentially
many thinned models. However, a very simple approximate averaging method works well in
practice. The idea is to use a single neural net at test time without dropout. The weights
of this network are scaled-down versions of the trained weights. If a unit is retained with
probability p during training, the outgoing weights of that unit are multiplied by p at test
time as shown in Figure 2. This ensures that for any hidden unit the expected output (under
the distribution used to drop units at training time) is the same as the actual output at
test time.

1 Like