EDIT: NVM, found this discussion.
@ptrblck I went this solution you posted elsewhere. However, if I am reading torch.nn.functional dropout()
correctly, this doesn’t apply the weight scaling inference rule:
From Ref: “Deep Learning” Section 7.12 - Dropout:
A key insight (Hinton et al., 2012c) involved in dropout is that we can approximate p_ensemble by evaluating p(y | x) in one model: the model with all units, but with the weights going out of unit i multiplied by the probability of including unit i. The motivation for this modification is to capture the right expected value of the output from that unit. We call this approach the weight scaling inference rule. […]
Because we usually use an inclusion probability of 1/2, the weight scaling rule usually amounts to dividing the weights by 2 at the end of training, and then using the model as usual. Another way to achieve the same result is to multiply the states of the units by 2 during training. Either way, the goal is to make sure that the expected total input to a unit at test time is roughly the same as the expected total input to that unit at train time, even though half the units at train time are missing on average.
Or am I missing something?