Hi,
Why does the dropout implementation in test time scale the outputs and not the weights?
Wouldn’t scaling the weights give the same result as scaling the outputs but requires only a single time operation?
Ex. If dropout applied with “p”.
And for a single neuron, o = xw, where o is the output, x is the input, and w is the weight, then
po = xpw.
But the RHS multiplication is a single time op (better for inference model deployed).
What was the design choice behind scaling outputs in model.eval mode?
Thank you,
Siddhanth Ramani