This is not pytorch-specific, I just hope to get an advice from practitioners.
In their Fixup Initialization: Residual Learning Without Normalization, authors suggest:
Fixup initialization (or: How to train a deep residual network without normalization)
- Initialize the classification layer and the last layer of each residual branch to 0.
- Initialize every other layer using a standard method (e.g., Kaiming He), and scale only the weight layers inside residual branches by … .
- Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer.
I used to think that zero init is a pitfall. What am I missing?