Update self defined parameters when using distributed data parallel

jetcai1900 · July 7, 2020, 12:12am

I am using distributed data parallel to train the model on multiple gpus. I meet one problem: I have used register_buffer to define one parameter. In addition, I need to manually update it. How could I achieve this? I tried to do the same update as the model is trained on one gpu, but the results are not correct. It seems that the value of this parameter is not synchronized over gpus. Thanks a lot

mrshenli · July 7, 2020, 2:38am

If this is a model parameter, any reason for using register_buffer instead of register_parameter?

In addition, I need to manually update it. How could I achieve this?

If it is a parameter (not a buffer) and if you don’t expect the autograd engine to compute gradients for you, you can set its .requires_grad field to False before passing the model to the DDP ctor. Then, DDP won’t sync it’s grads and optimizer won’t update the parameter value.

I tried to do the same update as the model is trained on one gpu, but the results are not correct. It seems that the value of this parameter is not synchronized over gpus.

I might miss something. Looks like you want to manually update a parameter, but still want DDP to help synchronize the parameter across the GPUs/processes? I don’t fully understand the use case, could you please elaborate this? If this parameter is manually updated, would it be possible to let all processes to set it to the same value?

jetcai1900 · July 7, 2020, 3:17am

“I might miss something. Looks like you want to manually update a parameter, but still want DDP to help synchronize the parameter across the GPUs/processes”’
Yes. This is what I would like to achieve.

For example
class model():
def init(self):
a = torch.zeros((3, 1))
self.register_buffer(“a”, a)

def update_a(self, b):
self.a.add_(b)
b is a vector that is dynamic with respect to the input data. So I could not manually set it to the same value.

Let me know whether I describe the problem clearly? Thanks a lot for your help.

xuChenSJTU · November 8, 2021, 9:01am

hi, @jetcai1900 , have you solved the problem? I met the same problem here, hope you can give me some advice, thanks