I’m just doing a hyperparameter search on a network and realized at some point I set p=1 for my dropout layer, and the network was still able to learn MNIST to about 85%. It of course does not do as well as networks with some optimal p, in this case it was 0.3, but I don’t understand how the network can still learn.
Wouldn’t p=1 mean all activations are shut off every trial, and also it seems at some point dropout applies 1/(1-p) to activations, which would lead to a division by zero error (which I never get any error for). Is this even possible or should it be returning an error as well as not learning, hence I’m implementing dropout wrong?
Thanks.