Why would someone zero out the parameters of a module in the constructor?

Vedant_Roy · July 21, 2022, 9:08am

I’m reading through Improving Denoising Diffusion Models. In the implementation, the authors sometimes zero out the parameters of a module in the constructor.

See this link: improved-diffusion/unet.py at 783b6740edb79fdb7d063250db2c51cc9545dcd1 · openai/improved-diffusion · GitHub

        self.out_layers = nn.Sequential(
            normalization(self.out_channels),
            SiLU(),
            nn.Dropout(p=dropout),
            zero_module(
                conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1)
            ),
        )

where

def zero_module(module):
    """
    Zero out the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().zero_()
    return module

Why would someone want to do this?

shivammehta007 · July 21, 2022, 12:54pm

I think this is the hack from FixUp Initialization. Where you initialize the last layers to zero, so that they do not contribute initially to the loss term, basically a way to make gradients flow to initial layers rather than the last layer learning everything.

Vedant_Roy · July 21, 2022, 11:38pm

Thanks! This makes sense. However, if the last layers are initialized as zero, then they will never update right. How do we get the last layers to learn?

Vedant_Roy · July 22, 2022, 2:04am

There’s also an issue here, where if the weights of the last layer are 0, the ResBlock that the module is a part of won’t do anything since it will effectively just be as skip connection. So there must be something else going on.

Vedant_Roy · July 22, 2022, 6:51am

Marking this as the solution b/c the paper authors did say this was a part of fixup initialization.

shivammehta007 · July 22, 2022, 9:01am

No, they will learn, just not during the first gradient update, after the gradients will realise that they need to update these parameters to get something meaningful and learn. Also generally, you do not zero the bias and introduce a “prior” bias value which further helps in this learning.