I just want to make sure that I am understanding this part correctly. I see in the MultiheadAttention module (https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention) that if the $q, k, v$ tensors are all the same dimension, then the model calls
self.register_parameter(<name>, None) which is the used later in the
If the registered parameters are
None (or like a few lines before that
nn.Parameter(torch.empty(<size>))) then that would mean that they are there, but they have no effect on the calculation, more like just a placeholder so they exist. Is this correct?
This means that they are defined but don’t have values.
This is useful to make sure that user code won’t try to use this attribute (as it is already reserved to be a Parameter) leading to weird behavior later.
torch.empty(<size>) actually returns a Tensor, but it contains uninitialized memory (so can have any values in it).
So then do these
None parameters just have a null effect on the forward pass? like an identity matrix or something? Or is the calculation just kind of skipped in the forward pass?
It depends on how the forward function is implemented. If you try to access them, they will be None.
For example, the linear layer has this condition in the forward pass here where it checks if the bias is None or not.
COuld you elaborate more on what register_parameter(“a”, None) means? Does this mean “a” would not be used? Or just a is initialized to be none? Thanks.
It means it is initialized to None.
That allows you to do checks like
if self.a is None instead of having to check if the attribute “a” exists every time.
What is the advantage of using self.a instead of using the attribute “a”? I feel they are the same?
Yes they are very similar. The main advantages here are:
- Users can’t set anything but a Parameter on
a. So the attribute named is “reserved” for a Parameter
- I would argue that it is nicer to read
I see. Thanks a lot for your response. Could I ask a last question? Why after setting the “weight” parameter in BatchNorm1d to none, the BatchNorm1d cannot work in pytorch 1.5.0? But in pytorch 0.1.12, setting the “weight” parameter to none does not affect the usage of BatchNorm1d? Thanks so much.
Most likely due to some implementation details and refactors.
You should use
affine=False if you don’t want the linear transformation after the batchnorm.
I see. Thanks a lot for your help!