Dropout design choice

Hi,
Why does the dropout implementation in test time scale the outputs and not the weights?
Wouldn’t scaling the weights give the same result as scaling the outputs but requires only a single time operation?

Ex. If dropout applied with “p”.
And for a single neuron, o = xw, where o is the output, x is the input, and w is the weight, then
p
o = xpw.
But the RHS multiplication is a single time op (better for inference model deployed).
What was the design choice behind scaling outputs in model.eval mode?

Thank you,
Siddhanth Ramani

You are thinking about DropConnect, which drops the weights rather than the activations.

I don’t understand this claim, as Dropout is usually not used during inference and even if you want to use dropout during inference, the masking will still be random.