I was reading the code of mask-rcnn to see how they fix their bn parameters. I notice that they use self.register_buffer to create the weight and bias, while, in the pytorch BN definition, self.register_parameter is used when affine=True. Could I simply think that buffer and parameter have everything in common except that buffer will neglect the operations to compute grad and update its values ?
By the way, what is the different between directly defining a nn.Paramter in the module and using register_parameter ?
Yes, you are correct in your assumption. If you have parameters in your model, which should be saved and restored in the state_dict, but not trained by the optimizer, you should register them as buffers.
Buffers wonât be returned in model.parameters(), so that the optimizer wonât have a change to update them.
Both approaches work the same regarding training etc.
There are some differences in the function calls however. Using register_parameter you have to pass the name as a string, which can make the creation of a range of parameters convenient. Besides that I think itâs just coding style which one you prefer.
If I have some parameters that I donât want to be trained, can I just add them as self.some_params inside the nn.Module to preserve state? Does register_buffer do anything special in that case as compared to just storing it inside self?
If your self.some_params are nn.Parameter objects, then you donât have to worry about this. If theyâre tensors, then they wonât be in the state_dict (unless registered as buffer).
What are the downsides of not using a buffer? I am currently using self.some_param inside nn.Module to keep a tensor that keeps track of running average statistics of activations. I donât need it for backprop, only to make decisions during runtime. I want to learn more about why my approach is not an optimal one. If you could explain or give some readings, thatâd be great.
I am sorry if this is a stupid question, but I am not sure if I want that. I checked this, but I still donât see why I would need that. Would I need buffers if I want to save the model later? Are there any other reasons I would like to use state_dict rather than just assigning to self?
As @pierrecurie explained, one reason to register the tensor as a buffer is to be able to serialize the model and restore all internal states.
Another one is that all buffers and parameters will be pushed to the device, if called on the parent model:
As you can see, model.my_tensor is still on the CPU, where is was created, while all parameters and buffers were pushed to the GPU after calling model.cuda().
@ptrblck probably another dumb question, but why wouldnât I just use nn.Parameter for both my_tensor and my_param and just state ârequires_grad=Falseâ for the first? How would that be different to the example in your post?
I think there wouldnât be a difference regarding the model training, gradient flow etc., so you could probably use this approach.
However, it might be confusing to other users who are using your code to see some âbuffersâ in model.parameters().
Also, you would pass these buffers to the optimizer, if you just pass all model.parameters().
Again, this wonât mess with your training, but the optimizer will unnecessarily have to skip these buffers in its step() method.
I would describe it as a âcleanâ code style to separate buffers and parameters.
Ah, thanks. An example where I find this distinction difficult is in the context of fixed positional encodings in the Transformer model. Typically I see implementations where the fixed positional encodings are registered as buffers but Iâd consider these tensors as non-learnable parameters (that should show up in the list of model parameters), especially when comparing between methods that donât rely on such injection of fixed tensors.
Re. your last remark, I guess this should do the trick, but from that thread I understand it is poor coding practice.
So in general:
buffers = âfixed tensors / non-learnable parameters / stuff that does not require gradientâ
parameters = âlearnable parameters, requires gradientâ
Sort of hijacking the thread, but I am struggling at implementing capsule net, there is a need for some non-trainable variables, and unwanted in case of state_dict. Since those are just computed statistics.
So the problem is those variables are in the model code, which I use code like torch.zeros(b, h, w).cuda().
But this is ugly, and if use âtorch.zeros(b, h ,w)â, these variables will not be sent to GPU as we do model.to(device).
Please let me know if there is a better way to construct them.
Could you describe the usage of these tensors a bit?
I assume they are not defining the model state, as you donât want to have them in the state_dict, which means these tensors are independent of the model?
Could you create these tensors then during runtime, e.g. by using the device attribute of a parameter or buffer?
yeah, itâs a nicer workaround. Thanks.
But it will be better if there is a way to do this without setting device in the model part of code. So the whole model can be send to GPU or CPU as we set model.to(device)
model.to() transfers all âstatesâ to the specified device.
However, your use case seems as if the mentioned tensors should not be in the state_dict, which seems like a special use case.
Could you therefore explain the use case a bit, i.e.:
You could overwrite the to or apply methods for your module to include transferring that specific tensor. This way you would not have to pass the device to any additional parts of your module.
Hi, one more question:
I have a huge tensor (700MB, precomputed, requires_grad=False) which is used for tensor multiplication somewhere as a Module (as shown in the snippet)
When training the model with multiple GPUs, I need to push it to all GPUs. The easiest way would be using regist_buffer in a module. However this means the stat_dict would be larger than 700MB ( definitely not a good idea). So I was wondering the best way to push such a large tensor to all GPUs?
BTW, if I simply use âtensor.to(device)â , is the tensor gonna be pushed to all GPUs or only the default one? (Had a test, seems like it is on the defalult gpu not all gpus.)
Thanks in advance!
class NewModule(nn.Module):
def __init__(self, pre_matrix):
super(NewModule, self).__init__()
# Pre_matrix: NXP, of size 700MB, requires_grad=False
self.pre_matrix = pre_matrix
self.pre_matrix.requires_grad=False
# self.register_buffer('pre_matrix', pre_matrix) ### this means the stat_dic is larger than 700MB
def forward(self, input):
# input: MXN, on multiple gpus
# output: MXP, on multiple gpus
out = input @ self.Pre_matrix
return out