The 'p' in Dropout


In the dropout paper, the probability p stands for optimal probability of retention. i.e. p=1 means keep all activations.

In PyTorch, it’s the opposite. p stands for probability of an element to be zeroed. i.e. p=1 means switch off all activations.

Why the difference?

It seems to be the way to go for a lot of frameworks:
Keras dropout
Lasagne dropout
Mxnet dropout

Tensorflow.nn however seems to define it as the keep probability.

So it seems to vary a bit, but mostly it’s defined as the probabilty to zero out the input units.

1 Like