Dropout internal implementation

Hi there, I am studying the Dropout implementation in PyTorch. The original paper says that

" If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time"

this in order to balance the greater number of active connections between the layers. Anyway I have just found that pytorch does not seems to implement dropout in this way, when training boolean is false, it simply returns the output without rescaling.
Furthermore, Dropout does not seem to multiply the activations by a binary mask, I put an example here:

1- During training:

dr = torch.nn.Dropout(p=.5)
t = torch.tensor((2, 3), dtype=torch.float32)
print(t)
out = dr(t)
print(out)

tensor([2., 3.])
tensor([0., 6.])

I expected to have one 0 entry and the other one unchanged, instead the other one seems to be multiplied by 2 (different runs give me the other case: tensor([4., 0.]))

2-During validation:
The output remains unchanged tensor([2., 3.]) instead of being rescaled by p=.5. In fact in the C++ module it simply returns the input when train is False.

template<bool feature_dropout, bool alpha_dropout, bool inplace, typename T>
Ctype<inplace> _dropout_impl(T& input, double p, bool train) {
  TORCH_CHECK(p >= 0 && p <= 1, "dropout probability has to be between 0 and 1, but got ", p);
  if (p == 0 || !train || input.numel() == 0) {
    return input;
  }
...............

the source code I linked is in pytorch/aten/src/ATen/native/Dropout.cpp

PyTorch uses the inverse scaling during training, to avoid the operation during inference, which would yield the same overall result, and is mentioned in the dropout paper in section 10:

Another way to achieve the same effect is to scale up the retained activations by multiplying
by 1/p at training time and not modifying the weights at test time.

1 Like

During training with inverted dropout and keep probability p, the output of a neuron is (on average) p*x + (1-p)*0 = p*x. Now if we divided by 1/p the output of a neuron is on average x. Why then keeping the 1/p factor during inference? Shouldn’t 1/p become 1?

During inference the drop probability is not used if inverse scaling was already used during training as explained in my previous post.