What is the difference between register_parameter (requires_grad=False) and register_buffer in PyTorch?

During deep learning Training, the update of declared values with this variable is done under with torch.no _grad().

When learning with ddp, it’s the process of obtaining the average for each batch sample at each gpu, so no synchronization is required. Is there a difference between the two? register_buffer performs worse. It’s not like gradient is flowing, so I don’t know why the performance difference is significant.

example in model init

  1. self.matrix = nn.Parameter(torch.zeros(197, 192).cuda(), requires_grad=False)
  2. self.matrix = torch.zeros(197, 192).cuda()
  3. self.register_buffer(‘matrix’, torch.zeros(197, 192))

1 and 3 are same performence, but 2 is different

Cross-post from here whithout a followup.