Why would someone zero out the parameters of a module in the constructor?

I’m reading through Improving Denoising Diffusion Models. In the implementation, the authors sometimes zero out the parameters of a module in the constructor.

See this link: improved-diffusion/unet.py at 783b6740edb79fdb7d063250db2c51cc9545dcd1 · openai/improved-diffusion · GitHub

        self.out_layers = nn.Sequential(
            normalization(self.out_channels),
            SiLU(),
            nn.Dropout(p=dropout),
            zero_module(
                conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1)
            ),
        )

where

def zero_module(module):
    """
    Zero out the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().zero_()
    return module

Why would someone want to do this?

I think this is the hack from FixUp Initialization. Where you initialize the last layers to zero, so that they do not contribute initially to the loss term, basically a way to make gradients flow to initial layers rather than the last layer learning everything.

Thanks! This makes sense. However, if the last layers are initialized as zero, then they will never update right. How do we get the last layers to learn?

There’s also an issue here, where if the weights of the last layer are 0, the ResBlock that the module is a part of won’t do anything since it will effectively just be as skip connection. So there must be something else going on.

Marking this as the solution b/c the paper authors did say this was a part of fixup initialization.

No, they will learn, just not during the first gradient update, after the gradients will realise that they need to update these parameters to get something meaningful and learn. Also generally, you do not zero the bias and introduce a “prior” bias value which further helps in this learning.

1 Like