Module registering None parameters

I just want to make sure that I am understanding this part correctly. I see in the MultiheadAttention module ( that if the $q, k, v$ tensors are all the same dimension, then the model calls self.register_parameter(<name>, None) which is the used later in the nn.functional.multi_head_attention_forward function.

If the registered parameters are None (or like a few lines before that nn.Parameter(torch.empty(<size>))) then that would mean that they are there, but they have no effect on the calculation, more like just a placeholder so they exist. Is this correct?

This means that they are defined but don’t have values.
This is useful to make sure that user code won’t try to use this attribute (as it is already reserved to be a Parameter) leading to weird behavior later.

Note that torch.empty(<size>) actually returns a Tensor, but it contains uninitialized memory (so can have any values in it).

1 Like

So then do these None parameters just have a null effect on the forward pass? like an identity matrix or something? Or is the calculation just kind of skipped in the forward pass?

It depends on how the forward function is implemented. If you try to access them, they will be None.
For example, the linear layer has this condition in the forward pass here where it checks if the bias is None or not.

1 Like

COuld you elaborate more on what register_parameter(“a”, None) means? Does this mean “a” would not be used? Or just a is initialized to be none? Thanks.

It means it is initialized to None.
That allows you to do checks like if self.a is None instead of having to check if the attribute “a” exists every time.

What is the advantage of using self.a instead of using the attribute “a”? I feel they are the same?

Yes they are very similar. The main advantages here are:

  • Users can’t set anything but a Parameter on a. So the attribute named is “reserved” for a Parameter
  • I would argue that it is nicer to read

I see. Thanks a lot for your response. Could I ask a last question? Why after setting the “weight” parameter in BatchNorm1d to none, the BatchNorm1d cannot work in pytorch 1.5.0? But in pytorch 0.1.12, setting the “weight” parameter to none does not affect the usage of BatchNorm1d? Thanks so much.

Most likely due to some implementation details and refactors.
You should use affine=False if you don’t want the linear transformation after the batchnorm.

I see. Thanks a lot for your help!