Fixup initialisation for residual networks

This is not pytorch-specific, I just hope to get an advice from practitioners.

In their Fixup Initialization: Residual Learning Without Normalization, authors suggest:

Fixup initialization (or: How to train a deep residual network without normalization)

  1. Initialize the classification layer and the last layer of each residual branch to 0.
  2. Initialize every other layer using a standard method (e.g., Kaiming He), and scale only the weight layers inside residual branches by … .
  3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer.

I used to think that zero init is a pitfall. What am I missing?

a cutout from Figure 1