Module registering None parameters

jwillette · February 23, 2020, 10:53am

I just want to make sure that I am understanding this part correctly. I see in the MultiheadAttention module (https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention) that if the $q, k, v$ tensors are all the same dimension, then the model calls self.register_parameter(<name>, None) which is the used later in the nn.functional.multi_head_attention_forward function.

If the registered parameters are None (or like a few lines before that nn.Parameter(torch.empty(<size>))) then that would mean that they are there, but they have no effect on the calculation, more like just a placeholder so they exist. Is this correct?

albanD · February 23, 2020, 6:55pm

This means that they are defined but don’t have values.
This is useful to make sure that user code won’t try to use this attribute (as it is already reserved to be a Parameter) leading to weird behavior later.

Note that torch.empty(<size>) actually returns a Tensor, but it contains uninitialized memory (so can have any values in it).

jwillette · February 24, 2020, 12:05am

So then do these None parameters just have a null effect on the forward pass? like an identity matrix or something? Or is the calculation just kind of skipped in the forward pass?

albanD · February 24, 2020, 1:06am

It depends on how the forward function is implemented. If you try to access them, they will be None.
For example, the linear layer has this condition in the forward pass here where it checks if the bias is None or not.

jetcai1900 · July 13, 2020, 2:30am

COuld you elaborate more on what register_parameter(“a”, None) means? Does this mean “a” would not be used? Or just a is initialized to be none? Thanks.

albanD · July 13, 2020, 1:41pm

It means it is initialized to None.
That allows you to do checks like if self.a is None instead of having to check if the attribute “a” exists every time.

jetcai1900 · July 13, 2020, 1:43pm

What is the advantage of using self.a instead of using the attribute “a”? I feel they are the same?

albanD · July 13, 2020, 1:45pm

Yes they are very similar. The main advantages here are:

Users can’t set anything but a Parameter on a. So the attribute named is “reserved” for a Parameter
I would argue that it is nicer to read

jetcai1900 · July 13, 2020, 1:51pm

I see. Thanks a lot for your response. Could I ask a last question? Why after setting the “weight” parameter in BatchNorm1d to none, the BatchNorm1d cannot work in pytorch 1.5.0? But in pytorch 0.1.12, setting the “weight” parameter to none does not affect the usage of BatchNorm1d? Thanks so much.

albanD · July 13, 2020, 1:54pm

Most likely due to some implementation details and refactors.
You should use affine=False if you don’t want the linear transformation after the batchnorm.

jetcai1900 · July 13, 2020, 2:04pm

I see. Thanks a lot for your help!